return to table of content

Launch HN: Retell AI (YC W24) – Conversational Speech API for Your LLM

dang
12 replies
4d3h

[stub for offtopicness]

threeseed
1 replies
4d4h

Why is every comment here from an account with no other comments ?

dang
0 replies
4d3h

Ugh. Sorry. Probably some of their users found out about this thread.

I'm going to move all of this to an offtopic stub and collapse it.

We tell founders to make sure this doesn't happen (see https://news.ycombinator.com/yli.html) but I probably need to make the message louder. Not everyone understands that the culture of HN doesn't work this way.

productlordtr
1 replies
4d4h

Do you hire?

yanyan_evie
0 replies
4d4h

Thanks for asking. We are not hiring at this stage.

liangludev
1 replies
4d4h

cool

yanyan_evie
0 replies
4d4h

thank you!

langyou
1 replies
4d4h

Amazing, tried the dental front desk from playground. The voice sounds very natural and could hardly tell it's AI-generated.

yanyan_evie
0 replies
4d4h

glad you like it :)

Xavier_L
1 replies
4d4h

Cool!

yanyan_evie
0 replies
4d4h

Thank you!

xiangshu
0 replies
3d15h

really good

369316020
0 replies
4d4h

Very cool

JustinGu
8 replies
3d22h

Wow this is incredible. I've worked a bit in the conversational LLM space and one of the hardest problems we were struggling with was human interruption handling. From the demo it seems like you guys have it down. Can't wait to see where this goes :) BTW I don't think the demo works on mobile, tried it on safari on IOS and got no response.

yanyan_evie
5 replies
3d22h

It might ask for permission to use the microphone. If you can't find it, try going to the website's homepage, where you can enter your phone number to receive a call.

JustinGu
0 replies
3d21h

Retell is much stronger at handling human interruptions

gsharma
1 replies
3d21h

It seems to be broken on iOS Safari. I got no response after accepting the microphone prompt.

yanyan_evie
0 replies
3d21h

Thanks for the feedback. We will look into it

JustinGu
0 replies
3d21h

Yep, I gave permission for both my mac and phone but I got to try the demo out on mac anyways

staplar
1 replies
3d16h

They're most likely have two "agents" working in tandem to listen and speak, and it seems like the listener takes precedence over the speaker agent but underneath they share the same context window. Programming wise, probably using multithreading and channels architecture, depending on the programming language.

mvkel
0 replies
2d15h

What a cool trick to pull that off!

101008
8 replies
4d

I would feel deceived if I were a customer of any company or office that uses this. If I take the trouble to call by phone, it's because I want to speak with a person. If I wanted to talk to a machine, I would send an email, talk to a chatbot, or even try to communicate with the company through social media. Calling by phone implies that I am investing time and effort, and I expect the same from the other side.

AustinZzx
2 replies
4d

Totally understandable that most people would want to chat with a human agent (I sometimes share the same feeling). However, I do think that a major reason for that is voice bots were bad before and could not understand and get things done, and felt like waste of time. With advancements in voice AI and LLM, I'm confident that there would be more use cases where talking to a voice bot is not a bad experience.

vages
1 replies
3d23h

No. LLMs are worse for customer experience than their predecessors: LLMs confabulate, and their language is so smooth that you often need expertise to catch them in it.

People call customer service because they don’t know what to do. It would be better for most customers to talk to a bot that they can catch making a mistake.

Recent example: https://bc.ctvnews.ca/air-canada-s-chatbot-gave-a-b-c-man-th...

AustinZzx
0 replies
3d23h

Yes, I agree there are problems with LLM (hallucinations, persona, etc), and that's exciting because that means room for improvement and opportunities. I know many people who are working hard in that field trying to make LLM converse better.

For example - "hallucinations / LLMs confabulate": techniques like RAG can help - "Language is so smooth that you often need the expertise to catch them in it", fine-tuning and prompt engineering can help

ywj7931
0 replies
3d23h

Personally speaking, when I called the DMV and was asked to wait for 40 minutes, if an AI can help me solve that problem, I wouldn't mind. But I definitely understand that different people have different expectations.

stevofolife
0 replies
3d14h

I much rather talk to an AI bot than waiting on the line for a human for 50 minutes.

monkeydust
0 replies
3d23h

Its a good point and one the bot industry has not really figured out, forget voice bot but talking about those annoying ones telecom companies throw up. My immediate reaction when I get a bot is to throw in a bunch of garbage to get routed to human as fast as possible. When they get better, perhaps I might change my behaviour.

jp42
0 replies
3d23h

Personally I think if bot can get things done, then I wont mind. I just hope these bot don't repeat same things and don't get something done

aik
0 replies
3d23h

Completely disagree. You’re not making a phone call in most cases for entertainment purposes. If the options are wait in line for 20 minutes or speak to an actually useful bot, I would take the latter in 100% of cases.

niblettc
5 replies
4d4h

This is incredible and terrifying at the same time. Does it support long context? As in, can I voice chat with an instance of an agent, and then later in a different chat refer to items discussed in the previous chat? Can I also type / text with the agent and have it recall items from a previous session?

yanyan_evie
4 replies
4d4h

That's an interesting point! We did consider adding memory to the voice agent, and we have use cases like an AI therapy session wanting to know the former conversation with the patient. Adding the previous chat would be very helpful as well.

Cheer2171
3 replies
4d3h

an AI therapy session

oh no

yanyan_evie
2 replies
4d3h

The use case I recall involves a nonprofit organization focused on preventing suicide. They are hoping for an AI therapy solution capable of listening to patients and picking up the phone when no human is available. This isn't entirely unacceptable because one of the therapist's roles is to listen to problems, so AI can effectively substitute in this aspect.

toomuchtodo
0 replies
4d1h

You're not wrong, and I agree this is a great use case, but consider calling it crisis response vs a therapist. A therapist is there to help you dig deep, over a long time, crisis response is a tactical mechanism to prevent imminent self harm.

Amazing product, looking forward to working with it.

thirdusername
0 replies
3d11h

If it was only to gap fill then that sounds reasonable, but other risks here is the voice agent picks up slack and lowers the pressure for staffing and working on solving these problems in the first place.

What is worse, that no one is available to listen to you when you're suicidal, or that you lack so much value that only a machine would talk to you. I'm sure some people would have an extremely poor reaction to that.

debarshri
5 replies
4d4h

Is there different language support too?

yanyan_evie
4 replies
4d4h

It's definitely in our roadmap. After the core product—the voice AI part—becomes humanlike enough, we will support multilingual capabilities.

debarshri
3 replies
4d4h

Spanish would be very helpful

yanyan_evie
2 replies
4d4h

Yes, if you don't mind, you could leave your email on the waiting list at the footer of the website. We could keep you posted!

vrc
1 replies
4d3h

I second this request. Specifically, for a lot of applications, native-language "good enough" might not suffice (a dental office with English speaking employees and predominantly English speaking customers), but between a stilted conversation with a non-native speaker (broken English to English v. slightly incorrect native language to their native language), there might be more tolerance for some of the hiccups that AI has. As in, I might get more information in a poor conversation in a person's native language than us trying to communicate in their poor English.

yanyan_evie
0 replies
4d2h

Yes. Great point

zachbee
4 replies
3d22h

What's the difference between your product and Gridspace's? I get the sense that your offering is more developer-focused, but I'm curious if there are any technical differences.

yanyan_evie
2 replies
3d22h

I believe Gridspace is an IVR solution, not one based on LLM. It's challenging to ask questions that deviate from the initial settings. We're using LLM to generate responses, which makes the conversation smoother.

evanmacmillan
0 replies
3d3h

Gridspace CEO here. Our virtual agents use many specialized LLMs. We run these models on our own soft switches at pretty big scale. We also build the software that lets contact center operators onboard and manage LLM-based, virtual agents. A lot is at stake with virtual agents in the contact center. Contact centers need virtual agents that sound great, get the facts right and complete service tasks. Some of our customers are very technical, some are not, all deeply care about how AI represents their businesses.

skeptrune
0 replies
3d14h

Holy hell Gridspace is good

samstave
4 replies
4d1h

With respect to Alignment, it should be a fundamental requirement that an speech AI is ___REQUIRED___ to honestly inform a Human if its speaking to an AI.

Can you please ensure, going forward, you have the universal "truth" as it were, to have your system always identify if its AI when "prompted" (irrespective of what app/dev has built - your API should ensure that if "safeword" is used it shall reveal its AI)

--

"trust me, if you ask them if they are a cop, they legally have to tell you they are a cop" (court rules its legal for cops to lie to you) etc....

(it should be like those tones on a hold-call, to remind you that youre still on hold... but instead its a constant reminder that this bitch is AI) -- there should be some Root-level escape word to require any use of this tool to contact a Human. That word used to be "operator" MANY times, but still...

Maybe if a conversation with an elderly Human goes on with too many "huh? I cant hear you" or "i dont understand, can you repeat that" questions, your AI knows its talking to a non-tech Human, and it should re-MIND the Human that youre just an AI. (meaning no sympathy, emotion, it will not stop until you are dead) etc...

Guardrails, motherfucker, Do you speak it!"

AustinZzx
3 replies
4d

Good point. Currently, our product does not contain LLM, as we are purely voice API -- instead the developer is bringing in their own LLM solutions and gets to decide what to say. This would be a great guardrail to build in for all sorts of reasons, will see how we can suggest our users adopt it.

samstave
2 replies
3d23h

May I please understand your arch;

a dev builds an app it | to your API and you spit it back out? - if so - ensure when you spit out whatever it defines itself to whomever is listening....

--

Plz explain the arch of how your system works? (or link me if I missed..)

----

Shortest and most importnat law ever written:

"an AI must identify itself as AI when asked by Humans."

0. Law of robotics.

------

@autsin

-

Cool - so im on an important call with [your customer] your system has an outage?

How is this handled? dropped call?

(I am not being cynical - im being someone who is allergic to post mortems.

----

EDIT: you need to stop using the term "user" in anything you market or describe. full stop.

the reason: in the case of your product, the USER is the motherhecker on the phone listening to anything your CUSTOMER is spewing at them VIA your API.

the USER is who is making IRL *>>>DECISIONS<<<* based on what they hear from your system.

Your CUSTOMER is from whom you receive money.

THEIR customer, is whom they get money to pay you.

The USER is the end-point Human. who doesnt even know you exist.

AustinZzx
0 replies
3d23h

We handle the audio bytes in / out, and also connect to your user's server for response. We handle the interaction and decide when to listen and when to talk, and send live updates to our users. When a response is needed, we ask for it and get it from our user.

Our homepage https://www.retellai.com/ has a GIF on it that illustrates this point.

AustinZzx
0 replies
3d20h

Nice catch on the working -- customer is indeed more accurate than user.

For outage handling: we strive to keep up 99.9 plus up time, and in the case of a dropped call, the agent would hang up if using phone, and might have different error handling in web depending on how customer handles it.

stuartjohnson12
1 replies
4d4h

Wow, I agree! That was beyond expectations. The only let-down was the AI contradicted itself when I tried layering on conditionals. It was something like this:

"What time works"

"Morning on tuesday would be best, but I can also do afternoon"

"I'm sorry, I didn't catch what time in the afternoon you wanted"

"No, I said the morning"

"I'm having a hard time hearing you. What time in the morning did you want?"

"10am"

And from there things were fine. It seemed very rigid on picking a time and didn't suggest times when I laid out a range.

yanyan_evie
0 replies
4d3h

Great point! Theres's some room for the prompt to improve~

yanyan_evie
0 replies
4d4h

glad you like the demo!

gitgud
0 replies
3d13h

Incredible, even has a bit of "human-like" passive-aggression when I was asking dumb questions:

    [Me] what kind of dental equipment do you use?
    [AI] (sigh) we use a variety of reputable brands for our dental equipment, was there anything specific you'd like to know?
It almost sounded like she was rolling her eyes at that question, like I was wasting her time haha

nsokolsky
4 replies
3d21h

The demo is nice but it makes me wonder: why would a company have a fully automated voice line rather than a booking interface? As a customer I'm never happy to call a company to make a reservation. I'd be extra annoyed if an AI picked up and I had to go through the motions of a conversation instead of doing two clicks in a Web UI.

yanyan_evie
2 replies
3d21h

Yes, for booking appointments, a simple interface might do the trick. However, we've seen many excellent use cases of our API that prevent repetitive tasks and help companies save money, like AI logistics assistants, pre-surgery data collection, AI tutor and AI therapists. I believe the future will bring even more voice interface applications. Imagine not having to navigate complex UIs; you could easily book a flight or a hotel just by speaking. Also, older people might prefer phone calls over navigating UI interfaces.

kiney
0 replies
3d10h

Booking a flight with a voice interface is 10 steps backwards from web UIs where I can see different options, prices, calendars....

arcastroe
0 replies
3d7h

As a user, I want to talk to LLMs to answer my random day to day questions. I tried the Retell demo and asked it questions that I had earlier asked Alexa and Google Assistant. The Retell experience was a million times better. I wish I could set it as my phone's (Android) default assistant. If you built that, I would pay good money for it!

However, I agree with others. What I don't want to do is talk to LLMs for customer support. For that, I always want to talk to a real person. It would be infuriating for an LLM to pick up when calling a business.

AustinZzx
0 replies
3d21h

I totally get that clicking in Web UI is super convenient in many scenarios, and I think GUI and voice can co-exist and create synergy. Suppose AI voice agent can solve your problem, cater to your needs, and interact like a human. In that case, I believe it would be super helpful in many scenarios (like others mentioned, waiting on line for 40 minutes is a pretty bad experience). There are also new opportunities in voice like AI companion, AI assistant, etc that we see are starting to emerge.

lawrencechen
4 replies
3d12h

Can you buy/remove phone numbers programmatically? I can't find anything in the docs.

lawrencechen
0 replies
3d10h

thanks!

lawrencechen
0 replies
3d10h

thanks!

jamesmcintyre
4 replies
4d2h

Until this demo the most impressive conversational experiences I've seen were Pi and Livekit's Kitt demo (https://livekit.io/kitt). I do not think kitt was quite as fast in response time (as retell) but incredibly impressive for being fully opensource and open to any choice of api's (imagine kitt with groq api + deepgram's aura for super low latency).

Retell focusing on all of the other weird/unpredictable aspects of human conversation sounds super interesting and the demo's incredible.

Things are moving so fast, wow.

russ
1 replies
3d16h

But we don't handle interruptions yet, that's some cool stuff @yanyan_evie!

AmuVarma
0 replies
3d8h

There are a lot of good VAD open source models that are easily configurable, and can be integrating in a day or 2 - checkout the silero vad model

yanyan_evie
0 replies
4d2h

Thanks! Will push harder

gfodor
4 replies
4d

Can you share what you're doing for TTS? Is it a proprietary fully pretrained in-house model, a fine tuned open source one, or a commercially licensed one?

AustinZzx
3 replies
4d

For TTS, we are currently integrating with different providers like Elevenlabs, Openai TTS, etc. We do have plans down the road to train our own TTS model.

gfodor
2 replies
4d

Ah thank you! What's the lowest latency option you have found so far?

AustinZzx
1 replies
3d18h

Deepgram TTS is pretty fast, but they have not publicly launched yet.

gfodor
0 replies
3d15h

ty!

standapart
3 replies
3d14h

I forgot that we're playing the demo games now.

Well I'll see your demo, and raise you another demo

https://x.com/sean_moriarity/status/1760435005119934862

Except that demo was done by a single person, with a day job and no cap table.

yanyan_evie
2 replies
3d14h

Love to see people has interest over our demo. But, I checked his former post, and he already built a voice Agent a couple of days ago. So I believe he just changed the prompts to create this demo: https://twitter.com/sean_moriarity/status/175895035375034804.... Also, after testing their live demo at https://nero.ngrok.dev/, it appears it might still be missing some key features, such as latency and endpoint detection.

reissbaker
1 replies
3d11h

As a bystander, I definitely agree that Retell's demo wins on latency — it felt pretty close to human. The voice also sounds more natural to my ears (not sure why, since Nero is apparently using Eleven Labs for voices, and Eleven's voices usually sound pretty good to me).

Playing around with the Nero demo, it also feels pretty... broken? I never actually managed to make a booking with Nero; it seemed to often not be able to tell I was done speaking, and would just hang after a couple back-and-forths. Still impressive for a one-man demo built quickly, but they don't feel in the same league.

I do wish Retell's pricing was cheaper, though; $6/hr is pretty much the cost of a call center employee in India, and LLMs still perform below the average human on most things. That being said, I imagine the cost of Retell will come down as the tech advances, and LLMs will also improve over time.

seanmor5
0 replies
3d8h

Yeah the latency in my demo is definitely worse. There also seem to be some issues where it picks up it’s own audio and keyboard/mouse clicks and tries to transcribe them which leads to derailed convos.

The unnatural voices happen I think because for some reason the way I stream audio to the browser breaks up the audio from ElevenLabs, so you get almost this “start/stop” sound that makes it sound worse than usual.

There’s a lot of minor details to get right, but it’s a fun problem to try to solve. I wish Retell the best :)

nnf
3 replies
3d23h

This is very interesting. One thing I wondered about the per-minute pricing is how to keep a phone agent like this from being kept on the phone in order to run up a bill for a company using it. It'd be very inexpensive to make many automated calls to an AI bot like the dentist receptionist in the demo, and to just play a recording of someone asking questions designed to keep the bot on the phone.

As a customer of a service like Retell (though of course not specific to Retell itself), how might one go about setting up rules to keep a phone conversation from going on for too long? At 17¢ per minute, a 6-minute call will cost just over $1, or about $10 per hour. Assuming the AI receptionist can take calls outside of business hours (which would be a nice thing for a business to offer), then such a malicious/time-wasting caller could start at closing time (5pm) and continue nonstop until opening time the next day (8am), with that 15 hour span costing the business $150 for billable AI time. If the receptionist is available on weekends (from Friday at 5pm until Monday at 8am), that's a 63-hour stretch of time, or $630. And if the phone system can handle 10 calls in parallel, the dentist could come in Monday morning to an AI receptionist bill of over $6,300 for a single weekend (63 hours × $10 per hour × 10 lines).

This is in no way a reflection on Retell (I think the service is compelling and the usage-based pricing is fair, and with that being the only cost, it's approachable and easy for people to try out). The problem of when to end a call is one I hadn't considered until now. Of course you could waste the time of a human receptionist who is being paid an hourly wage by the business, but that receptionist is going to hang up on you when it becomes clear you're just wasting their time. But an AI bot may not know when to hang up, or may be prevented from doing so by its programming if the human (or recording) on the other end of the line is asking it not to hang up. You could say it shouldn't ever take more than five minutes to book a dentist appointment, but what if the person has questions about the available dental procedures, or what if it's a person who can't hear well or a non-native speaker who has trouble understanding and needs the receptionist to repeat certain things? A human can handle that easily, but it seems difficult to program limits like this in a phone system.

nsokolsky
1 replies
3d21h

What stops it for regular human-operated phone lines?

nextworddev
0 replies
3d13h

humans can hang up

AustinZzx
0 replies
3d22h

This can be handled with function calling and other features in LLM. We support the input signal of closing the call, and you can have your rule-based (timer) system or LLM-based end call functionality and use that to hang up.

esafak
3 replies
4d

Congratulations! How do you position yourselves against Google Duplex/Dialogflow and other competitors? https://cloud.google.com/dialogflow

AustinZzx
2 replies
4d

We strive to make conversation humanlike, so maybe less contact center ops development, but more focus on performance and customizability of voice interactions. As a startup, our edge over big tech is being nimble and executing fast.

esafak
1 replies
3d20h

I would keep working on positioning; I feel that your language is woolly at times:

we focus most of our energy on innovating the AI conversation experience, making it more magical day by day. We pride ourselves on wowing our customers when they experience our product themselves.

This is not useful; you already have testimonials to show what customers think.

Maybe convert that first FAQ point about differentiation into a table comparing you against the closest competitors. Since you talk about performance you should measure it. Use a standard benchmark if there is one for your field.

AustinZzx
0 replies
3d19h

Good point, note token. benchmarking is a great tool to show differentiation. BTW, apart from what we think is important ourselves (latency, mean opinion score, etc), would you mind sharing what you want to see in such a benchmark? One key metric I like to keep an eye on is the end conversion rate of using the product, but that's very use-case specific.

CuriouslyC
3 replies
3d21h

This is interesting but having a piece like my speech engine tied to a specific model provider is a non-starter. I'll probably become a customer at some point if you guys just make a cheap API for streaming natural voice from LLM text ouput, if open source tools don't solve that problem conclusively before then.

AustinZzx
2 replies
3d21h

Could you elaborate a bit on "my speech engine tied to a specific model provider"? Sorry, I might be lacking some context on what you are referring to here.

CuriouslyC
1 replies
3d21h

I will be in the market for a text-to-speech engine, but from looking at the website it seems the model of Retell is trying to push is "use our all in one model + text to speech service" which is problematic when my choice of model and control over how that model runs is at the core of my product, and text to speech is a "nice to have" feature. I want an endpoint that I can fire off text to in a streaming mode, where it'll buffer that streaming text a little and then stream out a beautiful, natural sounding voice with appropriate emotion and intonation in a voice of my design. I'm sure I'm not really Retell's ideal customer, and they're going after lucrative "all in one" customers that just want to build on top of a batteries-included product.

yanyan_evie
0 replies
3d19h

If you are looking for a text to speech solution, you could use elevenlabs turbo model.

user_7832
2 replies
4d1h

It's really good, but the AI cracks still show up. Trying the demo therapist, I mentioned I'm not finding a job. It suggested finding a career counsellor and said "it would get back as soon as possible"... yeah, no it didn't. It claimed to be "working on it" but would say "I'm here if you want to speak...". It clearly doesn't understand what it's saying, it feels like bing's ai would be "better" at not claiming to do a task it can't.

ywj7931
1 replies
4d

Thanks for trying that out! Retell focuses on making the AI sound like human, it is developers' LLM responsibility to make it think smart. The therapist in the dashboard is for demo purpose only and ideally some developers will plug in their great AI therapist LLM to make it more human-like :)

user_7832
0 replies
2d16h

Thanks for the clarification! It appeared that the LLM was part of your product/service but if it can be changed by the devs that's good!

sanchitv
2 replies
2d23h

Excellent product and fantastic demo!

Would it be possible to integrate Retell with a RAG pipeline?

yanyan_evie
1 replies
2d17h

Yes, certainly can. We give the maximum customization for the LLM part.

sanchitv
0 replies
2d13h

Perfect, thanks for the follow-up!

omeze
2 replies
3d20h

the actual conversational flow is awesome. 800ms is only a little worse than internet audio latency (commonly 300-500ms on services like Discord or even in-game audio for things like Valorant)! Also cool that you can bring your own LLM and audio provider. awesome product!

yanyan_evie
0 replies
3d20h

Glad you're into the "bring your own LLM" feature—it's tough to fine-tune an LLM, but it's definitely worth it for the improved results.

AustinZzx
0 replies
3d20h

Thank you for the support!

nextworddev
2 replies
4d2h

How does this compare to vocode, another YC company?

yanyan_evie
0 replies
4d2h

If you have your own LLM, our feature is the most customizable. And since we don't own an LLM, we'll focus on making our Voice AI as human-like as possible.

yanyan_evie
0 replies
4d2h

I think vocode will focus more on open source libraries, they have tons of integrations. We don’t have any integrations, we only focus on the voice AI API part and leave the LLM part to customer.

monkeydust
2 replies
4d4h

Just tried, its impressive but needs work - trying to book an appointment out in 3 weeks, it acked that but could not confirm exact date time. Still impressed.

yanyan_evie
1 replies
4d4h

Thanks!! see you then

cwbuilds
2 replies
3d17h

Awesome product. We would love to use this for our app, but it wouldn't make sense economically. At $0.10 per minute it would cost significantly more than our existing TTS and SST solution. We've manually added a VAD and will have to add a way of handling interruption. All-in-all it roughly costs us $0.01 per minute and we just can't afford a 10X increase in costs.

Guessing you guys have found a use case with higher margins than ours which'll explain the price. Great work. Hope we can afford this one day.

yanyan_evie
1 replies
3d16h

Thanks. We get it—the current pricing doesn't fit everyone's budget. We will try to look into ways to roll out a more affordable option later.

cwbuilds
0 replies
3d4h

Great. Good luck with the launch

blakeburch
2 replies
4d3h

Just tried the dental appointment example. Voice sounds great! But I found two issues with sharing: - I told it I wasn't available until next year. We confirmed a date. It said Feb 4th, next year. I asked it when next year was and it gave me the definition. On further prying, it told me the current year was 2022, so next year was 2023. For a scheduling use case, it should be date/availability/time zone aware. - At the end, it got into a loop saying "I apologize for the confusion. Let me double check your records...". After staying silent, it said "it looks like we've been disconnected". I said "no, I was waiting for you to check my records". The loop repeated. I eventually asked how long it would take to check my records and it told me "a few minutes" but still went through the "disconnected" message.

yanyan_evie
0 replies
4d3h

Thanks for the great feedback! Absolutely, with a fine-tuned LLM or a better prompt, we can make the responses more reasonable. We'll make a note to update our demo prompt accordingly!

thirdusername
0 replies
3d11h

I had a similar experience when it offered to connect me to the office manager so I could remove PII from the system, where it just stated it was taking an action and went idle.

Delumine
2 replies
4d4h

Wonder what the justifications for the different voice prices are...

yanyan_evie
1 replies
4d3h

The different providers have different prices. openai tts & deepgram are cheaper, 11labs are higher

echelon
0 replies
4d1h

You'll be able to build your own high quality, low latency voices at scale.

yanyan_evie
0 replies
4d3h

yes… one of our favorite movies

tin7in
1 replies
3d21h

Very cool! Have you looked into call center agents use cases?

AustinZzx
0 replies
3d21h

Yes, we are a developer tool, and we certainly get interest from clients working on customer satisfaction agents, call center agent training, etc.

thimkerbell
1 replies
3d22h

Is there societal value that this product is harming?

AustinZzx
0 replies
3d19h

We are dedicated to preventing that from happening. Spam calling and identity theft are key areas we will build guardrails around. Feel free to let us know if you think of any other case.

tayloramurphy
1 replies
3d19h

One use case that I'd be interested in for this is training it to use my voice a la Descript. It's been really nice that our Head of Marketing can iterate on video voiceovers using my voice and all I had to do was read 30 seconds of copy.

Any plans for something like that? It'd be really interesting to have prospects what with an AI version of me to answer some basic questions.

AustinZzx
0 replies
3d19h

Voice cloning is definitely on our roadmap.

sidcool
1 replies
4d

Congrats on launching. It feels very natural and the demo call was good.

AustinZzx
0 replies
4d

Thanks for the support. Means a lot to us.

AustinZzx
0 replies
3d21h

We are actively working on that. Thanks for the support.

overstay8930
1 replies
3d15h

This is super cool, congrats on the launch!

AustinZzx
0 replies
3d11h

Glad you like it!

nprateem
1 replies
3d20h

I'm always curious for things like this where people get training data.

AustinZzx
0 replies
3d19h

If you are referring to the LLM used in the demo, it's a simple GPT. If you are referring to audio data, there are some (not a lot) public datasets, although be careful of the license of the dataset. To get more data, you might want to build a studio to collect from contracted voice actors, or you can purchase from other sources.

nostrebored
1 replies
3d17h

Commercially, I struggle to see how this fits in to the most natural application, a contact center.

For example, take Amazon Connect pricing:

There is an Amazon Connect service usage charge, based on end-customer call duration. At $0.018 per minute * 7 minutes = $0.126

There is an inbound call per minute charge for German DID numbers. At $0.0040 per minute * 7 minutes = $0.0280

If I were to bring retell into the loop, I'm changing my self-service per minute cost from .018 + .004 = .0184 per minute to .1184 per minute at the cheapest setting. And from that, I don't have a clear case for the impact on deflection, and because of that I don't have clear ROI.

This isn't saying that the product isn't great -- but at the current price I struggle to see how anyone at scale can use it without eating the double whammy of LLM + Retell costs.

yanyan_evie
0 replies
3d16h

Thanks for the heads-up. We're definitely aiming to cover customer service down the line. We'll be exploring ways to introduce a more affordable option soon.

lawrencechen
1 replies
3d19h

Could you add Partykit/Cloudflare Durable Objects integration docs?

AustinZzx
0 replies
3d18h

We will keep that in mind. In the meantime, if you have a working version, we are happy to feature your integration repo in our doc.

lalala6_89
1 replies
3d23h

If I record screen and audio when playing video games with my friends, will I be able to fine-tune an LLM+audio model on that dataset?

It'll be like that San Junipero episode of Black Mirror - immortality in a dark way.

AustinZzx
0 replies
3d21h

You certainly could, given you play video games for long enough to gather the needed data lol.

jsumrall
1 replies
3d10h

First thought is can’t you set up calls with prospects and have the product _literally_ sell itself?

yanyan_evie
0 replies
3d9h

Great idea. However, it seems like outbound sales calls could be annoying, and the FCC appears to restrict them.

intalentive
1 replies
3d20h

In the "Select Voice" dialog, all the DeepGram clips end in a loud click. Might want to fix that.

yanyan_evie
0 replies
3d20h

Roger that! Will fix it.

djyaz1200
1 replies
4d2h

I tried the demo, and it got confused and disconnected, but it's a cool proof of concept. Suggest bumping up the happiness emotion on the agent, and a Calendly integration would immediately unlock a lot of use cases. Good luck!

yanyan_evie
0 replies
4d2h

Thanks for the suggestion! Will take a look into the confusion problem

bricee98
1 replies
4d1h

The demo was incredible, and this seems perfect for my current project. I am going to try to integrate this as soon as I can to see if it works for me. How responsive can I expect the support to be?

yanyan_evie
0 replies
4d1h

We pride ourselves on being very responsive! We usually create a Slack group with users actively integrating and answer any questions ASAP

blindgeek
1 replies
4d3h

My friend group and I have been playing with LLMs: https://news.ycombinator.com/item?id=39208451. We tend to hang out in multi-user voice chat sometimes, and I've speculated that it would be interesting to hook an LLM to ASR and TTS and bring it into our voice chat. Yeah, naive, and to be honest, I'm not even sure where to start. Have you tried bringing your conversational LLM into a multi-person conversation?

yanyan_evie
0 replies
4d3h

It’s a great idea. We have a use case that they want to add voice agent into the zoom. Could schedule a call to talk about tech design

aik
1 replies
3d23h

Curious what model the dentist bot is running on? Tried it out, was surprisingly good, though eventually it contradicted itself (booked a slot it said previously was not available). (I get that’s the programming but am curious especially given the latency is really great).

AustinZzx
0 replies
3d21h

The demo uses a simple gpt 3.5 turbo.

_fw
1 replies
4d1h

This is absolutely wild - I got chills when I thought about the fact I’m talking to a computer. Congratulations on flying straight over uncanny valley.

AustinZzx
0 replies
4d

Thanks for the support, we still have a lot of work ahead of us to make it better!

Spiwux
1 replies
3d8h

I'm assuming you're targeting this mainly at enterprises and business use-cases such as callcenters, but are you planning to make this usable for personal use cases as well? For example, having a bot to bounce ideas off while coding. Pretty much "just" the TTS / STT layer to talk to my finetuned LLM in a natural manner while you handle interruptions and such.

I think the main issue right now for personal use would be cost (and I'm guessing STT / TTS are the most expensive parts..)

moffkalast
0 replies
3d7h

A rubber duck that talks back? Now AI has truly gone too far dammit!

Jommi
1 replies
3d21h

Sad that people think working on crypto is a waste of time yet here we are making antiquated contact methodologies even harder to prune out.

AustinZzx
0 replies
3d11h

Hey think about the bright side of voice AI -- there are new opportunities like AI companions, AI assistant, not only the antiquated contact methodologies.

Gulipad
1 replies
3d23h

Wow, this is sweet! With a little better latency and less perfection, it'd be well over the uncanny valley (not that it wouldn't fool many people as-is). Are you planning to add more "human" elements like filler words or disfluencies? If anything it feels too perfect to be human. Awesome stuff!

P.S: I tried to fool the Dental Office demo trying to book on Sunday or outside of the slots it had indicated, and it did a better job than many humans would have :)

yanyan_evie
0 replies
3d22h

Yes, we do plan to make the responses more conversational by adding pauses, filter words, slight stuttering, etc. This is also a high priority for us to work on.

waruka
0 replies
3d2h

How is this different from vocode.dev

tintor
0 replies
3d1h

backchanneling phrases are annoying in human conversation

mdfk
0 replies
3d7h

What languages are you supporting?

mattbarrie
0 replies
2d11h

It's pretty amazing but can get a bit lost/into a bugged state.

iraldir
0 replies
3d9h

- Did not work on firefox for me (start conversation and nothing would happen, I would not hear the voice) - on chrome it would not allow to change my microphone so had to open my macbook. - Also on firefox, when I first logged in, clicking on he "try now" button would send me back to the landing page, I had to go and click on the playground

With that out of the way, it's really interesting. The challenge I suppose is how narrow / wide the API made by the developer is. A narrow case, like "booking a dentist appointment", might feel like a step down compared to an online form, and most likely would fail short of satisfying someone calling, because if they're calling they had some deeper needs.

On the other hand, a wide API, like if you gave access to pricing information of the dental practice, info about the doctors, a way to reach an actual human being, health advice and post op advice, generate documents from past appointments etc, and you have a higher chance of hallucination, misclassification of what the user wants etc.

I'm still not sure if the best approach is just to leave the AI to deal with that, or to represent the user intent etc. with some sort of State Machine. Maybe a case by case etc. But the more you add in that logic, the slower the machine is.

goatgurux
0 replies
2d12h

is possible training add knowledge with a pdf? thanks in advance is all magic

exizt88
0 replies
3d7h

Unfortunately, based on my experience, GPT-3.5 is simply not up to snuff in phone conversations in terms of style and reasoning. And GPT-4T latency is far too large for real-time conversation with straightforward STT-TextGen-TTS pipeline.

davidmacias
0 replies
3d7h

This was impressive have you integrated this with something like Amazon Connect, or Twilio? Ultimately looking how I could bring it to my customers via what they use for their call centers.

benkaiser
0 replies
2d4h

This really makes me wonder if we can go full circle at fast-food restaurants like McDonalds. Ordering at a counter (10 seconds) has been replaced in recent years with touching a giant screen (2+ minutes).

This demo really shines light on how you could "talk" to place your order in a comparatively fast amount of time, similar to speaking with a human. Yes it can't handle all the million corner cases, and it will get it wrong a bunch of the time, but at least you can verify your order on a screen before tapping your phone to pay.