I developed an RSI-related injury back in 94/95 and have been using speech recognition ever since. I would love a solution that would let me move off of Windows. I would love a solution allowing me to easily dictate text areas in Firefox, Thunderbird, or VS code. Most important, however, would be the ability to edit/manipulate the text using what Nuance used to call Select-and-Say. The ability to do minor edits, replace sentences with new dictation, etc., is so powerful and makes speech much easier to use than straight captured dictation like most whisper apps. If you can do that, I will be a lifelong customer.
The next most important thing would be the ability to write action routines for grammar. My preference is for Python because it's the easiest target when using chatGPT to write code. However, I could probably learn to live with other languages (except JavaScript, which I hate). I refer you to Joel Gould's "natPython" package he wrote for NaturallySpeaking. Here's the original presentation that people built on. https://slideplayer.com/slide/5924729/
Here's a lesson from the past. In the early days of DragonDictate/NaturallySpeaking, when the Bakers ran Dragon Systems, they regularly had employees drop into the local speech recognition user group meetings and talk to us about what worked for us and what failed. They knew that watching us Crips would give them more information about how to build a good speech recognition environment than almost any other user community. We found the corner cases before anybody else. They did some nice things, such as supporting a couple of speech recognition user group conferences with space and employee time.
It seems like nuance has forgotten those lessons.
Anyway, I was planning on getting work done today, but your announcement shoots that in the head. :-)
[edit] Freaking impressive. It is clear that I should spend more time on this. I can see how my experience of Naturally Speaking limited my view, and you have a much wider view of what the user interface could be.
For those who don't know what happened next, and why Dragon seem to stagnant so much in the aughts, the story about how Goldman Sachs helped them sell to essentially Belgian Enron, months before they collapsed, was quite illuminating to me, and sad.
https://archive.ph/Zck6i
Goldman Sachs is such a wonderful model of what is possible via Capitalism. I think they are holding on what they really could achieve with a little will.
Down voters: sarcasm alert!
That's only the intro. Here's the conclusion: https://www.cornerstone.com/insights/cases/janet-baker-v-gol...
It’s crazy to me they were helped by what were essentially boys right out of college, and they had any faith it would work…
You should check out cursorless… it may be more directly targeting your use case
I saw it was based on Talon, but unfortunately, Talon makes things overly complex and focuses the user on the wrong part of the process. The learning curve to get started, especially when writing your action routines, is much higher than it needs to be. See: https://vocola.net/. It's not perfect; it's clumsy, but you can start creating action routines within 5 to 10 minutes of reading the documentation. Once you exceed the capabilities of Vocola, you can develop extensions in Python based on what you've learned in Vocola. One could say that Talon is the second system implementation according to Mythical Man Month.
My use case is dictating text into various applications and correcting that text within the text area. If I have to, I can use the dictation box and then paste it into the target application.
When you talk about using speech recognition for creating code, I've been through enough brute-force solutions like Talon to know they are the wrong way because they always focus the user on the wrong thing. When creating code, you should be thinking about the data structure and the environment in which it operates. When you use speech-driven programming systems, you focus on what you have to say to get the syntax you need to make it compile correctly. As a result, you lose your connection to the problem you're trying to solve.
Whether you like it or not, ChatGPT is currently the best solution as long as you never edit the code directly.
Thank you! We love hearing stories like this.
We want to get Aqua into as many places as possible — and will go full tilt into that as soon as the core is extremely extremely solid (this is our focus right now).
Great lessons from Dragon Dictation. Would love to learn more about the speech recognition user group meetings! Are those still running? Are you a part of any?
Unfortunately no. I think they faded out almost 20 years ago. The main problem was that without having someone able to create solutions, the speech recognition user group devolved into a bunch of crips complaining about how fewer and fewer applications work with speech recognition. We knew what was wrong; we knew how to iterate to where NaturallySpeaking should be, but nobody was there to do it.
FWIW, I am fleeing Fusebase, formally known as Nimbus, because they "pivoted" and messed up my notetaking environment. In the beginning, I went with Nimbus because it was the only notetaking environment that worked with Dragon. After the pivot, not so much. I'm giving Joplin a try. Aqua might work well as an extension to Joplin, especially if there was a WYSIMWYG (what you see is mostly what you get) front-end like Rich Markdown. I'd also look at heynote.
On a somewhat unrelated note, I remember Nuance used to be quite litigious, using its deep patent collection to sue startups and competitors. I'm not sure if this is still the case now that they're owned by Microsoft, but you may want to look into that.
I remember being in a conversation back in 2002 or so, where some Smalltalkers were brainstorming over the idea of controlling the IDE and debugger with voice.
It just so happens, that many of the interfaces one has to deal with are somewhat low bandwidth. (For example, many spend most of their time stepping over, stepping into, or setting breakpoints in a debugger.) Code completion greatly cuts down the number of options to be navigated second to second. It seems like the time has arrived for an interactive voice operated AI pair programmer agent, where the human is taking the "strategic" role.
I always felt coding could be such a great fit for voice recognition, as you have a limited number of tokens in scope and know all the syntax in advance (so recognition accuracy should be pretty good). Never saw a solution that really capitalized on that, though.