How hard on your end does the task of making the chatbot converse naturally look? Specifically I'm thinking about interruptions, if it's talking too long I would like to be able to start talking and interrupt it like in a normal conversation, or if I'm saying something it could quickly interject something. Once you've got the extremely high speed, theoretically faster than real time, you can start doing that stuff right?
There is another thing remaining after that for fully natural conversation, which is making the AI context aware like a human would be. Basically giving it eyes so it can see your face and judge body language to know if it's talking too long and needs to be more brief, the same way a human talks.
I made a 100% local voice chatbot using StyleTTS2 and other open source pieces (Whisper and OpenHermes2-Mistral-7B). It responds so much faster than ChatGPT. You can have a real conversation with it instead of the stilted Siri-style interaction you have with other voice assistants. Fun to play with!
Anyone who has a Windows gaming PC with a 12 GB Nvidia GPU (tested on 3060 12GB) can install and converse with StyleTTS2 with one click, no fiddling with Python or CUDA needed: https://apps.microsoft.com/detail/9NC624PBFGB7
The demo is janky in various ways (requires headphones, runs as a console app, etc), but it's a sneak peek at what will soon be possible to run on a normal gaming PC just by putting together open source pieces. The models are improving rapidly, there are already several improved models I haven't yet incorporated.