[stub for offtopicness]
Wow this is incredible. I've worked a bit in the conversational LLM space and one of the hardest problems we were struggling with was human interruption handling. From the demo it seems like you guys have it down. Can't wait to see where this goes :) BTW I don't think the demo works on mobile, tried it on safari on IOS and got no response.
It might ask for permission to use the microphone. If you can't find it, try going to the website's homepage, where you can enter your phone number to receive a call.
https://github.com/vocodedev/vocode-python
If your looking for phone flavour, also very good. Curious which is 'better'
Retell is much stronger at handling human interruptions
It seems to be broken on iOS Safari. I got no response after accepting the microphone prompt.
Thanks for the feedback. We will look into it
Yep, I gave permission for both my mac and phone but I got to try the demo out on mac anyways
They're most likely have two "agents" working in tandem to listen and speak, and it seems like the listener takes precedence over the speaker agent but underneath they share the same context window. Programming wise, probably using multithreading and channels architecture, depending on the programming language.
What a cool trick to pull that off!
I would feel deceived if I were a customer of any company or office that uses this. If I take the trouble to call by phone, it's because I want to speak with a person. If I wanted to talk to a machine, I would send an email, talk to a chatbot, or even try to communicate with the company through social media. Calling by phone implies that I am investing time and effort, and I expect the same from the other side.
Totally understandable that most people would want to chat with a human agent (I sometimes share the same feeling). However, I do think that a major reason for that is voice bots were bad before and could not understand and get things done, and felt like waste of time. With advancements in voice AI and LLM, I'm confident that there would be more use cases where talking to a voice bot is not a bad experience.
No. LLMs are worse for customer experience than their predecessors: LLMs confabulate, and their language is so smooth that you often need expertise to catch them in it.
People call customer service because they don’t know what to do. It would be better for most customers to talk to a bot that they can catch making a mistake.
Recent example: https://bc.ctvnews.ca/air-canada-s-chatbot-gave-a-b-c-man-th...
Yes, I agree there are problems with LLM (hallucinations, persona, etc), and that's exciting because that means room for improvement and opportunities. I know many people who are working hard in that field trying to make LLM converse better.
For example - "hallucinations / LLMs confabulate": techniques like RAG can help - "Language is so smooth that you often need the expertise to catch them in it", fine-tuning and prompt engineering can help
Personally speaking, when I called the DMV and was asked to wait for 40 minutes, if an AI can help me solve that problem, I wouldn't mind. But I definitely understand that different people have different expectations.
I much rather talk to an AI bot than waiting on the line for a human for 50 minutes.
Its a good point and one the bot industry has not really figured out, forget voice bot but talking about those annoying ones telecom companies throw up. My immediate reaction when I get a bot is to throw in a bunch of garbage to get routed to human as fast as possible. When they get better, perhaps I might change my behaviour.
Personally I think if bot can get things done, then I wont mind. I just hope these bot don't repeat same things and don't get something done
Completely disagree. You’re not making a phone call in most cases for entertainment purposes. If the options are wait in line for 20 minutes or speak to an actually useful bot, I would take the latter in 100% of cases.
This is incredible and terrifying at the same time. Does it support long context? As in, can I voice chat with an instance of an agent, and then later in a different chat refer to items discussed in the previous chat? Can I also type / text with the agent and have it recall items from a previous session?
That's an interesting point! We did consider adding memory to the voice agent, and we have use cases like an AI therapy session wanting to know the former conversation with the patient. Adding the previous chat would be very helpful as well.
an AI therapy session
oh no
The use case I recall involves a nonprofit organization focused on preventing suicide. They are hoping for an AI therapy solution capable of listening to patients and picking up the phone when no human is available. This isn't entirely unacceptable because one of the therapist's roles is to listen to problems, so AI can effectively substitute in this aspect.
You're not wrong, and I agree this is a great use case, but consider calling it crisis response vs a therapist. A therapist is there to help you dig deep, over a long time, crisis response is a tactical mechanism to prevent imminent self harm.
Amazing product, looking forward to working with it.
If it was only to gap fill then that sounds reasonable, but other risks here is the voice agent picks up slack and lowers the pressure for staffing and working on solving these problems in the first place.
What is worse, that no one is available to listen to you when you're suicidal, or that you lack so much value that only a machine would talk to you. I'm sure some people would have an extremely poor reaction to that.
Is there different language support too?
It's definitely in our roadmap. After the core product—the voice AI part—becomes humanlike enough, we will support multilingual capabilities.
Spanish would be very helpful
Yes, if you don't mind, you could leave your email on the waiting list at the footer of the website. We could keep you posted!
I second this request. Specifically, for a lot of applications, native-language "good enough" might not suffice (a dental office with English speaking employees and predominantly English speaking customers), but between a stilted conversation with a non-native speaker (broken English to English v. slightly incorrect native language to their native language), there might be more tolerance for some of the hiccups that AI has. As in, I might get more information in a poor conversation in a person's native language than us trying to communicate in their poor English.
Yes. Great point
What's the difference between your product and Gridspace's? I get the sense that your offering is more developer-focused, but I'm curious if there are any technical differences.
I believe Gridspace is an IVR solution, not one based on LLM. It's challenging to ask questions that deviate from the initial settings. We're using LLM to generate responses, which makes the conversation smoother.
Gridspace CEO here. Our virtual agents use many specialized LLMs. We run these models on our own soft switches at pretty big scale. We also build the software that lets contact center operators onboard and manage LLM-based, virtual agents. A lot is at stake with virtual agents in the contact center. Contact centers need virtual agents that sound great, get the facts right and complete service tasks. Some of our customers are very technical, some are not, all deeply care about how AI represents their businesses.
Gridspace’s lecture series on building / extending LLM’s for MIT IAP: https://youtube.com/playlist?list=PL6owWFYBB-AqnqbfPOudH3xGu...
Holy hell Gridspace is good
With respect to Alignment, it should be a fundamental requirement that an speech AI is ___REQUIRED___ to honestly inform a Human if its speaking to an AI.
Can you please ensure, going forward, you have the universal "truth" as it were, to have your system always identify if its AI when "prompted" (irrespective of what app/dev has built - your API should ensure that if "safeword" is used it shall reveal its AI)
--
"trust me, if you ask them if they are a cop, they legally have to tell you they are a cop" (court rules its legal for cops to lie to you) etc....
(it should be like those tones on a hold-call, to remind you that youre still on hold... but instead its a constant reminder that this bitch is AI) -- there should be some Root-level escape word to require any use of this tool to contact a Human. That word used to be "operator" MANY times, but still...
Maybe if a conversation with an elderly Human goes on with too many "huh? I cant hear you" or "i dont understand, can you repeat that" questions, your AI knows its talking to a non-tech Human, and it should re-MIND the Human that youre just an AI. (meaning no sympathy, emotion, it will not stop until you are dead) etc...
Guardrails, motherfucker, Do you speak it!"
Good point. Currently, our product does not contain LLM, as we are purely voice API -- instead the developer is bringing in their own LLM solutions and gets to decide what to say. This would be a great guardrail to build in for all sorts of reasons, will see how we can suggest our users adopt it.
May I please understand your arch;
a dev builds an app it | to your API and you spit it back out? - if so - ensure when you spit out whatever it defines itself to whomever is listening....
--
Plz explain the arch of how your system works? (or link me if I missed..)
----
Shortest and most importnat law ever written:
"an AI must identify itself as AI when asked by Humans."
0. Law of robotics.
------
@autsin
-
Cool - so im on an important call with [your customer] your system has an outage?
How is this handled? dropped call?
(I am not being cynical - im being someone who is allergic to post mortems.
----
EDIT: you need to stop using the term "user" in anything you market or describe. full stop.
the reason: in the case of your product, the USER is the motherhecker on the phone listening to anything your CUSTOMER is spewing at them VIA your API.
the USER is who is making IRL *>>>DECISIONS<<<* based on what they hear from your system.
Your CUSTOMER is from whom you receive money.
THEIR customer, is whom they get money to pay you.
The USER is the end-point Human. who doesnt even know you exist.
We handle the audio bytes in / out, and also connect to your user's server for response. We handle the interaction and decide when to listen and when to talk, and send live updates to our users. When a response is needed, we ask for it and get it from our user.
Our homepage https://www.retellai.com/ has a GIF on it that illustrates this point.
Nice catch on the working -- customer is indeed more accurate than user.
For outage handling: we strive to keep up 99.9 plus up time, and in the case of a dropped call, the agent would hang up if using phone, and might have different error handling in web depending on how customer handles it.
I was skeptical but the demo is incredible (https://beta.retellai.com/home-agent)
Wow, I agree! That was beyond expectations. The only let-down was the AI contradicted itself when I tried layering on conditionals. It was something like this:
"What time works"
"Morning on tuesday would be best, but I can also do afternoon"
"I'm sorry, I didn't catch what time in the afternoon you wanted"
"No, I said the morning"
"I'm having a hard time hearing you. What time in the morning did you want?"
"10am"
And from there things were fine. It seemed very rigid on picking a time and didn't suggest times when I laid out a range.
Great point! Theres's some room for the prompt to improve~
glad you like the demo!
Incredible, even has a bit of "human-like" passive-aggression when I was asking dumb questions:
[Me] what kind of dental equipment do you use?
[AI] (sigh) we use a variety of reputable brands for our dental equipment, was there anything specific you'd like to know?
It almost sounded like she was rolling her eyes at that question, like I was wasting her time hahaThe demo is nice but it makes me wonder: why would a company have a fully automated voice line rather than a booking interface? As a customer I'm never happy to call a company to make a reservation. I'd be extra annoyed if an AI picked up and I had to go through the motions of a conversation instead of doing two clicks in a Web UI.
Yes, for booking appointments, a simple interface might do the trick. However, we've seen many excellent use cases of our API that prevent repetitive tasks and help companies save money, like AI logistics assistants, pre-surgery data collection, AI tutor and AI therapists. I believe the future will bring even more voice interface applications. Imagine not having to navigate complex UIs; you could easily book a flight or a hotel just by speaking. Also, older people might prefer phone calls over navigating UI interfaces.
Booking a flight with a voice interface is 10 steps backwards from web UIs where I can see different options, prices, calendars....
As a user, I want to talk to LLMs to answer my random day to day questions. I tried the Retell demo and asked it questions that I had earlier asked Alexa and Google Assistant. The Retell experience was a million times better. I wish I could set it as my phone's (Android) default assistant. If you built that, I would pay good money for it!
However, I agree with others. What I don't want to do is talk to LLMs for customer support. For that, I always want to talk to a real person. It would be infuriating for an LLM to pick up when calling a business.
I totally get that clicking in Web UI is super convenient in many scenarios, and I think GUI and voice can co-exist and create synergy. Suppose AI voice agent can solve your problem, cater to your needs, and interact like a human. In that case, I believe it would be super helpful in many scenarios (like others mentioned, waiting on line for 40 minutes is a pretty bad experience). There are also new opportunities in voice like AI companion, AI assistant, etc that we see are starting to emerge.
Can you buy/remove phone numbers programmatically? I can't find anything in the docs.
We have an open-source demo repo: Node.js demo: https://github.com/adam-team/retell-backend-node-demo
Python demo: https://github.com/adam-team/python-backend-demo
thanks!
The doc is here: https://docs.retellai.com/guide/phone-setup
thanks!
Until this demo the most impressive conversational experiences I've seen were Pi and Livekit's Kitt demo (https://livekit.io/kitt). I do not think kitt was quite as fast in response time (as retell) but incredibly impressive for being fully opensource and open to any choice of api's (imagine kitt with groq api + deepgram's aura for super low latency).
Retell focusing on all of the other weird/unpredictable aspects of human conversation sounds super interesting and the demo's incredible.
Things are moving so fast, wow.
We recently made it a lot easier to build your own KITT too: https://github.com/livekit/agents
But we don't handle interruptions yet, that's some cool stuff @yanyan_evie!
There are a lot of good VAD open source models that are easily configurable, and can be integrating in a day or 2 - checkout the silero vad model
Thanks! Will push harder
Can you share what you're doing for TTS? Is it a proprietary fully pretrained in-house model, a fine tuned open source one, or a commercially licensed one?
For TTS, we are currently integrating with different providers like Elevenlabs, Openai TTS, etc. We do have plans down the road to train our own TTS model.
Ah thank you! What's the lowest latency option you have found so far?
Deepgram TTS is pretty fast, but they have not publicly launched yet.
ty!
I forgot that we're playing the demo games now.
Well I'll see your demo, and raise you another demo
https://x.com/sean_moriarity/status/1760435005119934862
Except that demo was done by a single person, with a day job and no cap table.
Love to see people has interest over our demo. But, I checked his former post, and he already built a voice Agent a couple of days ago. So I believe he just changed the prompts to create this demo: https://twitter.com/sean_moriarity/status/175895035375034804.... Also, after testing their live demo at https://nero.ngrok.dev/, it appears it might still be missing some key features, such as latency and endpoint detection.
As a bystander, I definitely agree that Retell's demo wins on latency — it felt pretty close to human. The voice also sounds more natural to my ears (not sure why, since Nero is apparently using Eleven Labs for voices, and Eleven's voices usually sound pretty good to me).
Playing around with the Nero demo, it also feels pretty... broken? I never actually managed to make a booking with Nero; it seemed to often not be able to tell I was done speaking, and would just hang after a couple back-and-forths. Still impressive for a one-man demo built quickly, but they don't feel in the same league.
I do wish Retell's pricing was cheaper, though; $6/hr is pretty much the cost of a call center employee in India, and LLMs still perform below the average human on most things. That being said, I imagine the cost of Retell will come down as the tech advances, and LLMs will also improve over time.
Yeah the latency in my demo is definitely worse. There also seem to be some issues where it picks up it’s own audio and keyboard/mouse clicks and tries to transcribe them which leads to derailed convos.
The unnatural voices happen I think because for some reason the way I stream audio to the browser breaks up the audio from ElevenLabs, so you get almost this “start/stop” sound that makes it sound worse than usual.
There’s a lot of minor details to get right, but it’s a fun problem to try to solve. I wish Retell the best :)
This is very interesting. One thing I wondered about the per-minute pricing is how to keep a phone agent like this from being kept on the phone in order to run up a bill for a company using it. It'd be very inexpensive to make many automated calls to an AI bot like the dentist receptionist in the demo, and to just play a recording of someone asking questions designed to keep the bot on the phone.
As a customer of a service like Retell (though of course not specific to Retell itself), how might one go about setting up rules to keep a phone conversation from going on for too long? At 17¢ per minute, a 6-minute call will cost just over $1, or about $10 per hour. Assuming the AI receptionist can take calls outside of business hours (which would be a nice thing for a business to offer), then such a malicious/time-wasting caller could start at closing time (5pm) and continue nonstop until opening time the next day (8am), with that 15 hour span costing the business $150 for billable AI time. If the receptionist is available on weekends (from Friday at 5pm until Monday at 8am), that's a 63-hour stretch of time, or $630. And if the phone system can handle 10 calls in parallel, the dentist could come in Monday morning to an AI receptionist bill of over $6,300 for a single weekend (63 hours × $10 per hour × 10 lines).
This is in no way a reflection on Retell (I think the service is compelling and the usage-based pricing is fair, and with that being the only cost, it's approachable and easy for people to try out). The problem of when to end a call is one I hadn't considered until now. Of course you could waste the time of a human receptionist who is being paid an hourly wage by the business, but that receptionist is going to hang up on you when it becomes clear you're just wasting their time. But an AI bot may not know when to hang up, or may be prevented from doing so by its programming if the human (or recording) on the other end of the line is asking it not to hang up. You could say it shouldn't ever take more than five minutes to book a dentist appointment, but what if the person has questions about the available dental procedures, or what if it's a person who can't hear well or a non-native speaker who has trouble understanding and needs the receptionist to repeat certain things? A human can handle that easily, but it seems difficult to program limits like this in a phone system.
What stops it for regular human-operated phone lines?
humans can hang up
This can be handled with function calling and other features in LLM. We support the input signal of closing the call, and you can have your rule-based (timer) system or LLM-based end call functionality and use that to hang up.
Congratulations! How do you position yourselves against Google Duplex/Dialogflow and other competitors? https://cloud.google.com/dialogflow
We strive to make conversation humanlike, so maybe less contact center ops development, but more focus on performance and customizability of voice interactions. As a startup, our edge over big tech is being nimble and executing fast.
I would keep working on positioning; I feel that your language is woolly at times:
we focus most of our energy on innovating the AI conversation experience, making it more magical day by day. We pride ourselves on wowing our customers when they experience our product themselves.
This is not useful; you already have testimonials to show what customers think.
Maybe convert that first FAQ point about differentiation into a table comparing you against the closest competitors. Since you talk about performance you should measure it. Use a standard benchmark if there is one for your field.
Good point, note token. benchmarking is a great tool to show differentiation. BTW, apart from what we think is important ourselves (latency, mean opinion score, etc), would you mind sharing what you want to see in such a benchmark? One key metric I like to keep an eye on is the end conversion rate of using the product, but that's very use-case specific.
This is interesting but having a piece like my speech engine tied to a specific model provider is a non-starter. I'll probably become a customer at some point if you guys just make a cheap API for streaming natural voice from LLM text ouput, if open source tools don't solve that problem conclusively before then.
Could you elaborate a bit on "my speech engine tied to a specific model provider"? Sorry, I might be lacking some context on what you are referring to here.
I will be in the market for a text-to-speech engine, but from looking at the website it seems the model of Retell is trying to push is "use our all in one model + text to speech service" which is problematic when my choice of model and control over how that model runs is at the core of my product, and text to speech is a "nice to have" feature. I want an endpoint that I can fire off text to in a streaming mode, where it'll buffer that streaming text a little and then stream out a beautiful, natural sounding voice with appropriate emotion and intonation in a voice of my design. I'm sure I'm not really Retell's ideal customer, and they're going after lucrative "all in one" customers that just want to build on top of a batteries-included product.
If you are looking for a text to speech solution, you could use elevenlabs turbo model.
It's really good, but the AI cracks still show up. Trying the demo therapist, I mentioned I'm not finding a job. It suggested finding a career counsellor and said "it would get back as soon as possible"... yeah, no it didn't. It claimed to be "working on it" but would say "I'm here if you want to speak...". It clearly doesn't understand what it's saying, it feels like bing's ai would be "better" at not claiming to do a task it can't.
Thanks for trying that out! Retell focuses on making the AI sound like human, it is developers' LLM responsibility to make it think smart. The therapist in the dashboard is for demo purpose only and ideally some developers will plug in their great AI therapist LLM to make it more human-like :)
Thanks for the clarification! It appeared that the LLM was part of your product/service but if it can be changed by the devs that's good!
Excellent product and fantastic demo!
Would it be possible to integrate Retell with a RAG pipeline?
Yes, certainly can. We give the maximum customization for the LLM part.
Perfect, thanks for the follow-up!
the actual conversational flow is awesome. 800ms is only a little worse than internet audio latency (commonly 300-500ms on services like Discord or even in-game audio for things like Valorant)! Also cool that you can bring your own LLM and audio provider. awesome product!
Glad you're into the "bring your own LLM" feature—it's tough to fine-tune an LLM, but it's definitely worth it for the improved results.
Thank you for the support!
How does this compare to vocode, another YC company?
If you have your own LLM, our feature is the most customizable. And since we don't own an LLM, we'll focus on making our Voice AI as human-like as possible.
I think vocode will focus more on open source libraries, they have tons of integrations. We don’t have any integrations, we only focus on the voice AI API part and leave the LLM part to customer.
Just tried, its impressive but needs work - trying to book an appointment out in 3 weeks, it acked that but could not confirm exact date time. Still impressed.
Thanks!! see you then
If it does not work, try this link:https://calendly.com/retell-ai/retell-ai-user?month=2024-02
Awesome product. We would love to use this for our app, but it wouldn't make sense economically. At $0.10 per minute it would cost significantly more than our existing TTS and SST solution. We've manually added a VAD and will have to add a way of handling interruption. All-in-all it roughly costs us $0.01 per minute and we just can't afford a 10X increase in costs.
Guessing you guys have found a use case with higher margins than ours which'll explain the price. Great work. Hope we can afford this one day.
Thanks. We get it—the current pricing doesn't fit everyone's budget. We will try to look into ways to roll out a more affordable option later.
Great. Good luck with the launch
Just tried the dental appointment example. Voice sounds great! But I found two issues with sharing: - I told it I wasn't available until next year. We confirmed a date. It said Feb 4th, next year. I asked it when next year was and it gave me the definition. On further prying, it told me the current year was 2022, so next year was 2023. For a scheduling use case, it should be date/availability/time zone aware. - At the end, it got into a loop saying "I apologize for the confusion. Let me double check your records...". After staying silent, it said "it looks like we've been disconnected". I said "no, I was waiting for you to check my records". The loop repeated. I eventually asked how long it would take to check my records and it told me "a few minutes" but still went through the "disconnected" message.
Thanks for the great feedback! Absolutely, with a fine-tuned LLM or a better prompt, we can make the responses more reasonable. We'll make a note to update our demo prompt accordingly!
I had a similar experience when it offered to connect me to the office manager so I could remove PII from the system, where it just stated it was taking an action and went idle.
Wonder what the justifications for the different voice prices are...
The different providers have different prices. openai tts & deepgram are cheaper, 11labs are higher
You'll be able to build your own high quality, low latency voices at scale.
With a little more tweaking and training, the voice AI will sound like her in https://en.wikipedia.org/wiki/Her_(film).
yes… one of our favorite movies
Very cool! Have you looked into call center agents use cases?
Yes, we are a developer tool, and we certainly get interest from clients working on customer satisfaction agents, call center agent training, etc.
Is there societal value that this product is harming?
We are dedicated to preventing that from happening. Spam calling and identity theft are key areas we will build guardrails around. Feel free to let us know if you think of any other case.
One use case that I'd be interested in for this is training it to use my voice a la Descript. It's been really nice that our Head of Marketing can iterate on video voiceovers using my voice and all I had to do was read 30 seconds of copy.
Any plans for something like that? It'd be really interesting to have prospects what with an AI version of me to answer some basic questions.
Voice cloning is definitely on our roadmap.
Congrats on launching. It feels very natural and the demo call was good.
Thanks for the support. Means a lot to us.
Nice work. You seem to have addressed some of the challenges that arise in teaching computers to speak.
This blog breaks it down well:
We are actively working on that. Thanks for the support.
This is super cool, congrats on the launch!
Glad you like it!
I'm always curious for things like this where people get training data.
If you are referring to the LLM used in the demo, it's a simple GPT. If you are referring to audio data, there are some (not a lot) public datasets, although be careful of the license of the dataset. To get more data, you might want to build a studio to collect from contracted voice actors, or you can purchase from other sources.
Commercially, I struggle to see how this fits in to the most natural application, a contact center.
For example, take Amazon Connect pricing:
There is an Amazon Connect service usage charge, based on end-customer call duration. At $0.018 per minute * 7 minutes = $0.126
There is an inbound call per minute charge for German DID numbers. At $0.0040 per minute * 7 minutes = $0.0280
If I were to bring retell into the loop, I'm changing my self-service per minute cost from .018 + .004 = .0184 per minute to .1184 per minute at the cheapest setting. And from that, I don't have a clear case for the impact on deflection, and because of that I don't have clear ROI.
This isn't saying that the product isn't great -- but at the current price I struggle to see how anyone at scale can use it without eating the double whammy of LLM + Retell costs.
Thanks for the heads-up. We're definitely aiming to cover customer service down the line. We'll be exploring ways to introduce a more affordable option soon.
Could you add Partykit/Cloudflare Durable Objects integration docs?
We will keep that in mind. In the meantime, if you have a working version, we are happy to feature your integration repo in our doc.
If I record screen and audio when playing video games with my friends, will I be able to fine-tune an LLM+audio model on that dataset?
It'll be like that San Junipero episode of Black Mirror - immortality in a dark way.
You certainly could, given you play video games for long enough to gather the needed data lol.
First thought is can’t you set up calls with prospects and have the product _literally_ sell itself?
Great idea. However, it seems like outbound sales calls could be annoying, and the FCC appears to restrict them.
In the "Select Voice" dialog, all the DeepGram clips end in a loud click. Might want to fix that.
Roger that! Will fix it.
I tried the demo, and it got confused and disconnected, but it's a cool proof of concept. Suggest bumping up the happiness emotion on the agent, and a Calendly integration would immediately unlock a lot of use cases. Good luck!
Thanks for the suggestion! Will take a look into the confusion problem
The demo was incredible, and this seems perfect for my current project. I am going to try to integrate this as soon as I can to see if it works for me. How responsive can I expect the support to be?
We pride ourselves on being very responsive! We usually create a Slack group with users actively integrating and answer any questions ASAP
My friend group and I have been playing with LLMs: https://news.ycombinator.com/item?id=39208451. We tend to hang out in multi-user voice chat sometimes, and I've speculated that it would be interesting to hook an LLM to ASR and TTS and bring it into our voice chat. Yeah, naive, and to be honest, I'm not even sure where to start. Have you tried bringing your conversational LLM into a multi-person conversation?
It’s a great idea. We have a use case that they want to add voice agent into the zoom. Could schedule a call to talk about tech design
Curious what model the dentist bot is running on? Tried it out, was surprisingly good, though eventually it contradicted itself (booked a slot it said previously was not available). (I get that’s the programming but am curious especially given the latency is really great).
The demo uses a simple gpt 3.5 turbo.
This is absolutely wild - I got chills when I thought about the fact I’m talking to a computer. Congratulations on flying straight over uncanny valley.
Thanks for the support, we still have a lot of work ahead of us to make it better!
I'm assuming you're targeting this mainly at enterprises and business use-cases such as callcenters, but are you planning to make this usable for personal use cases as well? For example, having a bot to bounce ideas off while coding. Pretty much "just" the TTS / STT layer to talk to my finetuned LLM in a natural manner while you handle interruptions and such.
I think the main issue right now for personal use would be cost (and I'm guessing STT / TTS are the most expensive parts..)
A rubber duck that talks back? Now AI has truly gone too far dammit!
Sad that people think working on crypto is a waste of time yet here we are making antiquated contact methodologies even harder to prune out.
Hey think about the bright side of voice AI -- there are new opportunities like AI companions, AI assistant, not only the antiquated contact methodologies.
Wow, this is sweet! With a little better latency and less perfection, it'd be well over the uncanny valley (not that it wouldn't fool many people as-is). Are you planning to add more "human" elements like filler words or disfluencies? If anything it feels too perfect to be human. Awesome stuff!
P.S: I tried to fool the Dental Office demo trying to book on Sunday or outside of the slots it had indicated, and it did a better job than many humans would have :)
Yes, we do plan to make the responses more conversational by adding pauses, filter words, slight stuttering, etc. This is also a high priority for us to work on.
How is this different from vocode.dev
backchanneling phrases are annoying in human conversation
Please make a free community self host option available like danswer from your own YC batch..
What languages are you supporting?
It's pretty amazing but can get a bit lost/into a bugged state.
- Did not work on firefox for me (start conversation and nothing would happen, I would not hear the voice) - on chrome it would not allow to change my microphone so had to open my macbook. - Also on firefox, when I first logged in, clicking on he "try now" button would send me back to the landing page, I had to go and click on the playground
With that out of the way, it's really interesting. The challenge I suppose is how narrow / wide the API made by the developer is. A narrow case, like "booking a dentist appointment", might feel like a step down compared to an online form, and most likely would fail short of satisfying someone calling, because if they're calling they had some deeper needs.
On the other hand, a wide API, like if you gave access to pricing information of the dental practice, info about the doctors, a way to reach an actual human being, health advice and post op advice, generate documents from past appointments etc, and you have a higher chance of hallucination, misclassification of what the user wants etc.
I'm still not sure if the best approach is just to leave the AI to deal with that, or to represent the user intent etc. with some sort of State Machine. Maybe a case by case etc. But the more you add in that logic, the slower the machine is.
is possible training add knowledge with a pdf? thanks in advance is all magic
Unfortunately, based on my experience, GPT-3.5 is simply not up to snuff in phone conversations in terms of style and reasoning. And GPT-4T latency is far too large for real-time conversation with straightforward STT-TextGen-TTS pipeline.
This was impressive have you integrated this with something like Amazon Connect, or Twilio? Ultimately looking how I could bring it to my customers via what they use for their call centers.
This really makes me wonder if we can go full circle at fast-food restaurants like McDonalds. Ordering at a counter (10 seconds) has been replaced in recent years with touching a giant screen (2+ minutes).
This demo really shines light on how you could "talk" to place your order in a comparatively fast amount of time, similar to speaking with a human. Yes it can't handle all the million corner cases, and it will get it wrong a bunch of the time, but at least you can verify your order on a screen before tapping your phone to pay.
Why is every comment here from an account with no other comments ?
Ugh. Sorry. Probably some of their users found out about this thread.
I'm going to move all of this to an offtopic stub and collapse it.
We tell founders to make sure this doesn't happen (see https://news.ycombinator.com/yli.html) but I probably need to make the message louder. Not everyone understands that the culture of HN doesn't work this way.
Do you hire?
Thanks for asking. We are not hiring at this stage.
cool
thank you!
Amazing, tried the dental front desk from playground. The voice sounds very natural and could hardly tell it's AI-generated.
glad you like it :)
Cool!
Thank you!
really good
Very cool