HN comments for: Launch HN: Aqua Voice (YC W24)

rickydroll

12 replies

1d1h

2024-03-26 16:30:41 UTC

I developed an RSI-related injury back in 94/95 and have been using speech recognition ever since. I would love a solution that would let me move off of Windows. I would love a solution allowing me to easily dictate text areas in Firefox, Thunderbird, or VS code. Most important, however, would be the ability to edit/manipulate the text using what Nuance used to call Select-and-Say. The ability to do minor edits, replace sentences with new dictation, etc., is so powerful and makes speech much easier to use than straight captured dictation like most whisper apps. If you can do that, I will be a lifelong customer.

The next most important thing would be the ability to write action routines for grammar. My preference is for Python because it's the easiest target when using chatGPT to write code. However, I could probably learn to live with other languages (except JavaScript, which I hate). I refer you to Joel Gould's "natPython" package he wrote for NaturallySpeaking. Here's the original presentation that people built on. https://slideplayer.com/slide/5924729/

Here's a lesson from the past. In the early days of DragonDictate/NaturallySpeaking, when the Bakers ran Dragon Systems, they regularly had employees drop into the local speech recognition user group meetings and talk to us about what worked for us and what failed. They knew that watching us Crips would give them more information about how to build a good speech recognition environment than almost any other user community. We found the corner cases before anybody else. They did some nice things, such as supporting a couple of speech recognition user group conferences with space and employee time.

It seems like nuance has forgotten those lessons.

Anyway, I was planning on getting work done today, but your announcement shoots that in the head. :-)

[edit] Freaking impressive. It is clear that I should spend more time on this. I can see how my experience of Naturally Speaking limited my view, and you have a much wider view of what the user interface could be.

phillco

4 replies

2024-03-26 17:43:32 UTC

when the Bakers ran Dragon Systems

For those who don't know what happened next, and why Dragon seem to stagnant so much in the aughts, the story about how Goldman Sachs helped them sell to essentially Belgian Enron, months before they collapsed, was quite illuminating to me, and sad.

https://archive.ph/Zck6i

nerpderp82

1 replies

21h23m

2024-03-26 21:03:02 UTC

Goldman Sachs is such a wonderful model of what is possible via Capitalism. I think they are holding on what they really could achieve with a little will.

jonplackett

0 replies

18h45m

2024-03-26 23:40:28 UTC

Down voters: sarcasm alert!

gcanyon

0 replies

6h36m

2024-03-27 11:50:15 UTC

That's only the intro. Here's the conclusion: https://www.cornerstone.com/insights/cases/janet-baker-v-gol...

Professor Gompers opined that at the time the acquisition closed, Dragon was a troubled company that was losing money and had regularly missed its own financial projections. It was highly uncertain whether Dragon could survive as a stand-alone entity. Professor Gompers also showed that technology stocks were on a downward trend, and L&H was the only buyer willing to pay the steep price Dragon demanded. Thus, he concluded that if the company had not accepted the L&H deal, Dragon likely would have declared bankruptcy. The jury found in favor of the defendants and awarded no damages to the plaintiffs.

Aeolun

0 replies

19h48m

2024-03-26 22:37:54 UTC

It’s crazy to me they were helped by what were essentially boys right out of college, and they had any faith it would work…

zellyn

1 replies

1d1h

2024-03-26 17:15:25 UTC

You should check out cursorless… it may be more directly targeting your use case

rickydroll

0 replies

23h16m

2024-03-26 19:10:10 UTC

I saw it was based on Talon, but unfortunately, Talon makes things overly complex and focuses the user on the wrong part of the process. The learning curve to get started, especially when writing your action routines, is much higher than it needs to be. See: https://vocola.net/. It's not perfect; it's clumsy, but you can start creating action routines within 5 to 10 minutes of reading the documentation. Once you exceed the capabilities of Vocola, you can develop extensions in Python based on what you've learned in Vocola. One could say that Talon is the second system implementation according to Mythical Man Month.

My use case is dictating text into various applications and correcting that text within the text area. If I have to, I can use the dictation box and then paste it into the target application.

When you talk about using speech recognition for creating code, I've been through enough brute-force solutions like Talon to know they are the wrong way because they always focus the user on the wrong thing. When creating code, you should be thinking about the data structure and the environment in which it operates. When you use speech-driven programming systems, you focus on what you have to say to get the syntax you need to make it compile correctly. As a result, you lose your connection to the problem you're trying to solve.

Whether you like it or not, ChatGPT is currently the best solution as long as you never edit the code directly.

jmcintire1

1 replies

2024-03-26 18:10:59 UTC

Thank you! We love hearing stories like this.

We want to get Aqua into as many places as possible — and will go full tilt into that as soon as the core is extremely extremely solid (this is our focus right now).

Great lessons from Dragon Dictation. Would love to learn more about the speech recognition user group meetings! Are those still running? Are you a part of any?

rickydroll

0 replies

23h6m

2024-03-26 19:19:36 UTC

Unfortunately no. I think they faded out almost 20 years ago. The main problem was that without having someone able to create solutions, the speech recognition user group devolved into a bunch of crips complaining about how fewer and fewer applications work with speech recognition. We knew what was wrong; we knew how to iterate to where NaturallySpeaking should be, but nobody was there to do it.

FWIW, I am fleeing Fusebase, formally known as Nimbus, because they "pivoted" and messed up my notetaking environment. In the beginning, I went with Nimbus because it was the only notetaking environment that worked with Dragon. After the pivot, not so much. I'm giving Joplin a try. Aqua might work well as an extension to Joplin, especially if there was a WYSIMWYG (what you see is mostly what you get) front-end like Rich Markdown. I'd also look at heynote.

stevenkkim

0 replies

22h46m

2024-03-26 19:39:46 UTC

On a somewhat unrelated note, I remember Nuance used to be quite litigious, using its deep patent collection to sue startups and competitors. I'm not sure if this is still the case now that they're owned by Microsoft, but you may want to look into that.

stcredzero

0 replies

2024-03-26 17:33:13 UTC

I remember being in a conversation back in 2002 or so, where some Smalltalkers were brainstorming over the idea of controlling the IDE and debugger with voice.

It just so happens, that many of the interfaces one has to deal with are somewhat low bandwidth. (For example, many spend most of their time stepping over, stepping into, or setting breakpoints in a debugger.) Code completion greatly cuts down the number of options to be navigated second to second. It seems like the time has arrived for an interactive voice operated AI pair programmer agent, where the human is taking the "strategic" role.

rkagerer

0 replies

14h8m

2024-03-27 04:17:39 UTC

I always felt coding could be such a great fit for voice recognition, as you have a limited number of tokens in scope and know all the syntax in advance (so recognition accuracy should be pretty good). Never saw a solution that really capitalized on that, though.

rafram

8 replies

23h50m

2024-03-26 18:36:06 UTC

This is cool! Some feedback:

- As others have said, "1000 tokens" doesn't mean anything to non-technical users and barely means anything to me. Just tell me how many words I can dictate!

- That serif-font LaTeX error rate table is also way too boring. People want something flashy: "Up to 7x fewer errors than macOS dictation" is cool, a comparison table is not.

- Similarly, ".05 Word Error Rate" has to go. Spell out what that means and use percentages.

- "Forgot a name, word, fact, or number? Just ask Aqua to fill it in for you." It would be nice to be able to turn this off, or at least have a clear indication when content that I did not say is inserted into my document. If I'm dictating, I don't usually want anything but the words I say on the page.

seabass-labrax

5 replies

19h8m

2024-03-26 23:17:55 UTC

People want something flashy: "Up to 7x fewer errors than macOS dictation" is cool, a comparison table is not.

Respectfully disagree on this one: as a startup, you can't effectively compete with the likes of Apple on flashiness. However, the very target market of those dictating large amounts of text will include a significant number of people in academia themselves. For those people, Aqua Voice will feel relevant. Those who aren't interested in comparison tables will simply skip over them :)

musiciangames

4 replies

11h42m

2024-03-27 06:44:05 UTC

One of my favourite nitpicks, but IMO 7x fewer errors means -6 times the error rate. Maybe error rate reduced by 86%.

zigzag312

3 replies

6h5m

2024-03-27 12:21:04 UTC

Isn't it just errors_count/7? (errors_count * 1/7)

For example, if you got 70 errors before you now get only 10 errors.

musiciangames

2 replies

1h39m

2024-03-27 16:46:32 UTC

7 times 70 =490. 490 fewer than 70 is -420. But words mean what you want them to mean, so 7 times fewer to mean 1/7th is becoming commonplace.

(edited because formatting swallowed asterisk for times)

zigzag312

0 replies

17m

2024-03-27 18:08:30 UTC

Is there any case where your interpretation makes sense? Would you ever say 0.86 times fewer instead of 0.86 of the size?

If new is 1/7 the size of the old, then 7 * new = old. It takes 7 times of new_count to get the value of old_count. 1/7th of the old_count. 7x fewer seems like a shorthand, but I'm not a native English speaker.

  x times != x times fewer

7 times => multiply

7 times fewer => multiply by fraction

djbusby

0 replies

2024-03-27 18:24:21 UTC

Gotta put a back-whack in like \* for *

jmcintire1

0 replies

23h35m

2024-03-26 18:50:41 UTC

Thanks for the feedback! On the last point, you can't see it in the sandbox, but the app has a Strict mode that does what you're looking for

bukacdan

0 replies

7h29m

2024-03-27 10:56:28 UTC

I was wondering whether the table actually comes from some paper, or it's just a marketing trick for techy folks.

hubraumhugo

8 replies

2024-03-26 18:12:29 UTC

Dictation software is huge in the healthcare industry. Every doctor uses it, and a solution like yours could likely make their work much more efficient.

Have you explored this market segment?

gardenhedge

6 replies

20h32m

2024-03-26 21:53:54 UTC

Why do doctors use it?

lmiller1990

4 replies

20h12m

2024-03-26 22:13:18 UTC

Not OP but a big part of a doctor's job is clinical notes. Typing is slow, talking is fast. Less time spent taking notes == more time with patient.

voiceblue

2 replies

17h0m

2024-03-27 01:26:16 UTC

From exploring this segment for a while, I believe that dictation software is the "brick" in the hair-on-fire analogy (hence, it provides some relief, but is far from an actual solution). There is a form of water (scribes on retainer) but it is too expensive for all but the most profitable of specialties. The problem to be solved is not "dictation but better," but "take this cognitive load away from doctors, and keep notes accurate." (which is what scribes with experience do.) In a broader sense, the problem to be solved is the American healthcare/insurance system (the reason these notes have to be taken in this way in the first place)...

Less time spent taking notes == more time with patient.

This can be true in some cases, but from what I understand, industry wide it would end up more like:

Less time spent taking notes == more patients scheduled.

Which is still of value, but fails to solve the original point of a physician's frustration, and possibly makes it worse (assuming the physician is still the one generating, handling and verifying the notes, but with better efficiency).

picohen

1 replies

11h34m

2024-03-27 06:52:00 UTC

check https://vetrec.io :)

voiceblue

0 replies

3h42m

2024-03-27 14:43:41 UTC

Thanks! As I mentioned, I've been looking into this for a while. I'll add it to my list.

Now I have:

    - vetrec.io
    - Abridge
    - Scribeberry
    - Scribematic
    - Notezap
    - Lytte
    - Deepscribe
    - FreedAI
    - s10 AI
    - Nable
    - DeepCura
    - DAX copilot
    - Suki
    - M*Modal
    - Amazon Healthscribe

Slightly different (?) - maybe more human:

    - Overnight Scribe
    - Rev.ai

hackable_sand

0 replies

13h39m

2024-03-27 04:46:22 UTC

Plus the output is legible

quadragenarian

0 replies

16h42m

2024-03-27 01:43:57 UTC

My wife is a Radiologist and uses voice transcription literally ALL DAY LONG as she reads imaging and transcribes her findings. Powerscribe from Nuance in case your curious

eggdaft

0 replies

11h21m

2024-03-27 07:05:02 UTC

I’d consider dentistry first. It’s still an open market in terms of SaaS, and they tend to have the same computer sat there all day constantly switching between the patient and the machine.

oliviabenson

7 replies

1d2h

2024-03-26 16:04:01 UTC

You’re early and this is effectively a demo but just in case this is a blind spot: “token” is an in-the-weeds LLMism that means nothing in the context of transcription. Your costs may be measured in tokens but that’s not relevant to customers. Just “A free trial” with no quantifier would be better than 1k tokens.

agotterer

4 replies

1d1h

2024-03-26 17:08:26 UTC

This is a great point and a topic I’ve been thinking about myself. As more LLM services pop up that are subject to token/consumption pricing, what is the right pricing model for consumer based consumption products like this?

oliviabenson

2 replies

1d1h

2024-03-26 17:24:13 UTC

Price based on value. Pricing is hard, something as simple as per-token is alluring because it doesn’t require any thought but it’s leaving a lot of money on the table. There’s nothing unique about LLMs when it comes to pricing, all common pricing wisdom applies.

agotterer

1 replies

2024-03-26 17:37:15 UTC

That seems challenging to do with a writing/note taking app like this. First, what would the pricing tiers be based on? Word count? That would just be another way of saying token. Number of documents created? That puts you at risk of long unprofitable documents. Google Sheets doesn’t really have this problem because the incremental cost of storage is relatively cheap. Tokens on the other hand are not cheap.

How do you price based on value without a corollary to tokens? If you charged $40 for this service then maybe you don’t provide enough value for the casual user who does the occasional school report. On the other hand you may be unprofitable for the doctor that decides to dictate all of her interactions every day or the author who dictates an entire book.

samatman

0 replies

18h13m

2024-03-27 00:12:35 UTC

First, what would the pricing tiers be based on? Word count? That would just be another way of saying token.

A customer sees "word count", they understand what's going on perfectly, right away. Tokens? More than half of them will think "what, like, game tokens? do I have to buy them in advance?"

Generously, 10% of potential customers are going to have even an approximate idea of what a token means in this context, maybe 1% could tell you that words and tokens aren't quite the same thing.

nprateem

0 replies

23h59m

2024-03-26 18:27:04 UTC

Words. Just estimate how many tokens that'd be and talk in words, paragraphs, etc instead.

the_king

0 replies

1d1h

2024-03-26 16:38:28 UTC

Appreciate the feedback, we'll take a look at that.

eggdaft

0 replies

11h16m

2024-03-27 07:09:38 UTC

Great point. Someone needs to replace this measure.

Words? Minutes? Number of edits? Eg

Free - try 10 minutes active editing a month, great for trying it out

Light use - 120 minutes a month, perfect for jotting down a few things daily

Pro - 600 minutes a month, write an entire essay by voice

Ultra - unlimited. Make voice editing your main workflow and work 10 times faster

GordonS

7 replies

1d2h

2024-03-26 16:24:57 UTC

The demo seemed to struggle a bit with my accent (Scottish), getting quite a few words wrong - for example, every time I said "test" it would write "taste". Is this something you can improve going forward?

umanwizard

3 replies

1d1h

2024-03-26 16:29:59 UTC

https://m.youtube.com/watch?v=NMS2VnDveP8

GordonS

2 replies

2024-03-26 18:25:26 UTC

In the past when I've been in the USA, I've legit had to put on an American accent when calling for taxis and the like!

I don't even have that strong an accent, and I always try my best to enunciate correctly when talking to others shrug

jmcintire1

1 replies

23h54m

2024-03-26 18:31:18 UTC

I'm getting married in Scotland in December and will presumably want to be able to demo so you can bank on priority support and a hard deadline :)

GordonS

0 replies

23h26m

2024-03-26 18:59:35 UTC

Lol, excellent :)

the_king

2 replies

1d1h

2024-03-26 17:05:38 UTC

Sorry about that. We know we need to be better about that and of course add more languages.

A few things to try to maximize your accuracy right now are:

- Don't use AirPods, especially not AirPods Pro. Most built-in laptop mics or EarPods or a gaming headset are perfect. It doesn't need to be podcast quality.

- Correct transcription mistakes as you would a person, then "plow through" and often the error will be corrected as you complete the sentence.

soco

0 replies

18h20m

2024-03-27 00:05:34 UTC

All "normal" voice programs struggle with us non-native speakers and our funny accents (sample size: 2). The first try on your site was satisfactory but I'll have to lurk around more just to feel safer... And yes, I am really looking forward for more languages. And switching between them!

pbronez

0 replies

16h50m

2024-03-27 01:35:54 UTC

What’s the problem with the AirPods? Too much pre-processing?

rhyme-boss

6 replies

2024-03-26 17:27:08 UTC

I use Apple dictation heavily for transcribing interviews. I've tried all the voice-to-text services out there and none have been reliable enough *at transcribing an audio file. I've settled on playing audio in my headphones and pausing while I carefully dictate text into a document. If I could upload the audio file, get a first-pass transcription, and then go through and edit / make corrections with voice, that would be awesome.

A difference in error rate from 20-something percent down to less than 5 percent sounds incredible.

mathisd

2 replies

2024-03-26 17:43:49 UTC

Have you tried using Whisper from OpenAI ? Aiko [0] have Whisper-v2-large built-in and allow for transcription of audio file

[0] https://apps.apple.com/fr/app/aiko/id1672085276

jonplackett

0 replies

18h36m

2024-03-26 23:49:25 UTC

This is amazing. Just tried really mumbling a long for a while and it got every word.

LeoPanthera

0 replies

17h13m

2024-03-27 01:12:49 UTC

Is there anything like this for watching foreign television (or radio)? I don't want to create a document, I just want real-time translated subtitles, but I can't do it in advance for live shows.

hantusk

0 replies

22h1m

2024-03-26 20:25:13 UTC

Check out Descript. It's been awesome when I used it in the past

codeptualize

0 replies

2024-03-26 17:39:27 UTC

Have you tried openai whisper? Last time I compared it was quite a bit better than all the other options.

c0brac0bra

0 replies

7h14m

2024-03-27 11:11:38 UTC

Deepgram has been incredibly accurate for me.

mavsman

5 replies

17h41m

2024-03-27 00:44:32 UTC

Since voice-to-text has gotten so good I've used it a lot more and also noticed how distracting and confusing it can be. Using Apple's dictation has a similar feel to this where you're constantly seeing something that's changing on the screen. It's kind of irritating and I don't really know what the solution is.

One suggestion I have here is to have at least two different sections of the UI. One part would be the actual document and the other would be the scratchpad. It seems like much of what you say would not actually make it into the document (edits, corrections, etc) so those would only be shown in the scratchpad. Once the editor has processed the text from the scratchpad then it can go into the document how it's supposed to. Having text immediately show up in the document as it's dictated is weird.

Your big challenge right now is just that STT is still relatively slow for this usecase. Time will be on your side in that regard as I'm sure you know.

Good luck! Voice is the future of a lot of the interactions we have with computers.

codercowmoo

3 replies

15h12m

2024-03-27 03:13:20 UTC

Distil-whisper is incredibly fast. Realtime on a 3060 Ti, and I used it to transcribe an 11 hour audiobook in 9 minutes.

peddling-brink

2 replies

12h3m

2024-03-27 06:23:16 UTC

You know, those audiobooks already have transcriptions. Often written by the original author!

I kid. Your comment made me think of a shower thought I had recently where I wished my audiobook had subtitles.

jazzyjackson

1 replies

11h37m

2024-03-27 06:48:43 UTC

It really is a little absurd IMO that the text of the book is sold separately from the audio.

chrisaiv

0 replies

8h8m

2024-03-27 10:18:06 UTC

Book publishing industry is different from audio recording industry.

robbomacrae

0 replies

15h5m

2024-03-27 03:20:54 UTC

Not trying to hijack this. Great demo! But STT can be very much real-time now. Try SoundHound's transcription service available through the Houndify platform [0] (we really don't market this well enough). It's lightning fast and it's half of what powers the Dynamic Interaction demos that we've been putting out.

I actually made a demo just like this aqua voice internally (unfortunately didn't get prioritized) but there is really no lag. However it will always be the case where the model will want to "revisit" transcribed words based on what comes next. So if you want the best accuracy you do want to wait a sec or two for the transcription to settle down a bit.

[0]: https://www.houndify.com

jzellis

4 replies

22h16m

2024-03-26 20:09:50 UTC

I tried it in Firefox on my Android and got this error when I tried to use the demo:

"Error: NotSupportedError: AudioContext.createMediaStreamSource: Connecting AudioNodes from AudioContexts with different sample-rate is currently not supported."

saint11

0 replies

22h3m

2024-03-26 20:23:11 UTC

Same here

mrandish

0 replies

22h0m

2024-03-26 20:25:33 UTC

FYI to the devs... I got the same error on Firefox Win11 x64.

jmcintire1

0 replies

19h12m

2024-03-26 23:13:18 UTC

This is fixed!

jmcintire1

0 replies

21h55m

2024-03-26 20:30:52 UTC

Patching this now!

ForrestN

4 replies

13h13m

2024-03-27 05:13:04 UTC

This is so cool! Great work. I'm writing this comment using Aqua Voice, and it's very impressive. I've been waiting for something like this. As a neurodivergent person, certain tasks (cough, email, cough) are about 10 times harder sitting down at my computer than they are handling them aloud with my assistant.

I'm sure you get this feedback 100 times a day, but I'd gladly pay a substantial amount to use this in place of the system dictation on my Mac and iPhone. Right now, the main limitation to me using it constantly would be the endless need to copy and paste from this separate new document editor into my email app or into Notion or Google Docs, etc.

the_king

2 replies

12h39m

2024-03-27 05:46:52 UTC

Awesome. Agree on the copy-paste annoyance, we're working on more clients.

But I do think that the reliability needs to take a few more steps before it becomes a true keyboard replacer.

ForrestN

1 replies

12h19m

2024-03-27 06:06:22 UTC

Thanks for all your hard work! Even, as a start, I found myself asking the app to copy the text to the clipboard for me without even thinking. Might be nice to be able to do that more seamlessly, just as a start?

You've moved us all a lot closer to my dream: taking a long walk outside, AirPods in, and handling the day's email without even looking at a screen once.

the_king

0 replies

11h47m

2024-03-27 06:38:54 UTC

That's a great idea, we should do that.

I have a similar dream, we'll make it happen!

ForrestN

0 replies

10h57m

2024-03-27 07:28:31 UTC

Two more small pieces of feedback, in case they're useful:

- Consider a time-based free trial. As others have said, tokens are confusing, but also your model is unlimited so the chunk of tokens doesn't allow me to see what it might be like to actually use your product. I'm more than halfway through my tokens after writing an HN comment and a brief todo list for work, so I've been able to see what it'd be like to pay the $10 for about 5 minutes worth of work, which feels like a very short trial. A week, say, seems fair? And then you have some kind of cap on tokens that only comes up if someone uses an abusively huge amount (an issue, I'm sure, you'd face with paying customers too, right?)

- I had a bit of trouble with making a todo list—I kept wanting the system to do a "new line" or "next item" and show me a new line with a dash so I know I'm dictating to the right place, but I couldn't coax it into doing that for me. I had to sort of just start on the next item and then use my keyboard to push return. When making lists, it's good to be able to do so fluidly and intentionally as much as possible. Sometimes it did figure out, impressively, that long pauses meant I wanted a new line. But not always.

tomberin

3 replies

1d2h

2024-03-26 15:36:58 UTC

I was impressed with the Demo, ready to pay 10 and no option to sign up with email :(

jmcintire1

2 replies

1d2h

2024-03-26 15:56:45 UTC

Made some tradeoffs for the sake of speed — email signup will come. We want it too!

freedomben

1 replies

23h50m

2024-03-26 18:35:44 UTC

Good to hear! I very much using my google account and other third parties to sign up for accounts.

Do you have any idea of how soon? Not looking for a public commitment to hang you with, just wondering if this is one of those "we're working on it now" (so days) or one of those "it's in the backlog" (months or maybe never depending on priorities).

jmcintire1

0 replies

12h10m

2024-03-27 06:15:41 UTC

Hah! Excellent question. Somewhere in between (i.e. it is not a p1 right now but "months" is way too long). Give us til after demo day then one of us will sit down and knock it out

samstave

3 replies

1d1h

2024-03-26 16:53:41 UTC

I have goosebumps!

Jiminy Crickets...

I have SOOO many use cases for your thing.

[edit: what does this mean: https://i.imgur.com/rHQt6ul.png when attempting to demo?]

---

* I want an agent that I can speak to on Mobile headset as I love to think out loud - and air my thoughts and thought process through talking through my internal dialogue - if this could just capture what I am saying and log it and I can refine thoughts as I go.

For example - I ride a lot. I try to cycle 1000 miles a month if I am doing a solid month - but else - I ride daily and its a movement meditation. as I ride - I think through things and I speak through thought processes with differing opposing 'experts' in my internal monologue to self-argue through to a solution....

If I could have this record all that, then random epipehnies I think through while on ride will be captured in a meaningful way.

---

* A meeting-notes-transcriber for whiteboard sessions.

* record everything you say in an interview and be able to review after for self-coaching

* talking through a dish as you wing the ingredients so that you speak out loud what you did (my grandmother was friends with Julia Child - my grandmother taught me to cook and when it came to measurements of things - they always wing it per feel/taste "salt to taste" for example means "eh... whatever"

so to be able to talk through what your 'winging it with' and it captures it into a salient reproducible recipe (i make a mean Chimi Churry (sometimes if I can recall)

* a voice "body cam" for things I may say in situations where I may be too flustered to recall.

* Speak authoring - start telling a story outline so it captures a synopsis that you can further develop

* Speech (like giving a speech) refinement as you can talk through the speech and capture and rework and reiterate etc

and thats just off the top of my head through your demo....

LOVE this.

jmcintire1

2 replies

2024-03-26 18:15:03 UTC

Thank you, awesome to see how many ideas this inspired! We've thought about a lot of similar things ourselves and will certainly build some of them :)

Sorry about the error you are getting! It's a Firefox thing. We will patch. In the meantime, Chrome/Safari will work

samstave

1 replies

21h40m

2024-03-26 20:45:23 UTC

<how many ideas this inspired!

I just want to qualify - you did not inspire these ideas.

These are desires sought which have been there for eons...

You are not inspiring them

You have atool that ENABLES them.

Seek that which already is a flustered pop of ideas waiting for a release valve for such thought.

You are not inspiring - you are enabling that which is already there, think of it as which valve to open - the pressure is mounting upon your dyke.

jmcintire1

0 replies

21h36m

2024-03-26 20:50:09 UTC

that's a better way to put it!

ortusdux

3 replies

18h58m

2024-03-26 23:27:27 UTC

I've wanted something like this for data entry for a while now. I often find my hands full measuring things and need to take notes. Can this output/format tabular data?

daotoad

1 replies

18h30m

2024-03-26 23:55:21 UTC

When I was playing around with the demo, I gave it a list of things to do and then asked it to convert the list to a markdown table and label the second column estimated duration. It worked like a charm. It set the first column heading to "Description" even.

I was then able to go through my list very quickly and add times to each item.

The one failure I had was when I asked it to add whitespace to visually align the table columns.

The table was, however, converted back into a list when I asked it to turn the text into an email.

It's not exactly punching numbers into a spreadsheet, but it worked pretty well for the simple use case I tried.

ortusdux

0 replies

3h13m

2024-03-27 15:12:35 UTC

Thanks, I'll have to try it out. Ideally I'd like to ask for a table with lettered/numbered columns and rows, and then just call out "B6 is 2.024" as I go.

ambersahdev

0 replies

18h48m

2024-03-26 23:37:46 UTC

I don't think it would perform too well in tabular contexts, at least with just natural language, for reasons I've explored in one of my own projects [https://github.com/AmberSahdev/Open-Interface/?tab=readme-ov...].

That said, I would still think it's very doable to set reserved keywords to navigate it ourselves while keeping it conversational.

elektor

3 replies

1d2h

2024-03-26 15:40:56 UTC

Trying out the app on Firefox gets me this error:

NotSupportedError:

AudioContext.createMediaStreamSource:

Connecting AudioNodes from AudioContexts with different sample-rate is currently not supported.

I would add that this really needs to be a native app with ability to use it within Microsoft Word, which itself has a decent voice to text tool built in.

jmcintire1

2 replies

1d2h

2024-03-26 16:04:23 UTC

Sorry about the Firefox error! Agreed on the sentiment behind native app — we plan to get Aqua in as many places as possible asap. For product iteration, you can’t beat the speed the browser affords.

btown

1 replies

1d2h

2024-03-26 16:18:57 UTC

Make an Electron app that simply wraps your website! Just build in best-practices updating of the wrapper as well from day one, in case you want to ship improvements to the wrapper or start to move more things to client side processing.

As a side benefit, you get real estate in people’s docks and desktops :)

freedomben

0 replies

23h46m

2024-03-26 18:39:58 UTC

Would that help with the problem of integration though? What would be absolutely killer would be to emulate a USB HID keyboard or something, which would make it usable with pretty much everything, though there are definitely some security considerations there. Or if there are higher-level APIs to hook into that could work, but I would guess those would also require native function calls.

The way Google's keyboard works on Android, but on my Linux computer (and my Android phone) would be my dream here. I'd pay $10 a month for that for sure.

eggdaft

3 replies

11h23m

2024-03-27 07:03:02 UTC

Suggestion for go-to-market: if you haven’t already done so, try to sell this direct to universities. They are an absolute pain to deal with, but they have various obligations to show how they are supporting dyslexic students etc and this fits the bill perfectly.

Their lead times are long so I’d start establishing trials now, partner with a university on agreement of a discount and use them to develop the software.

Once you’ve got a few uni’s onboard, you can rapidly expand and they are very unlikely to churn as you’re serving such an important niche.

I know a very similar product that has had huge success doing this.

the_king

1 replies

11h14m

2024-03-27 07:11:37 UTC

Great idea.

This is definitely worth looking into. We don't want to slow down the pace of development, but this might be a case where the partnership makes too much sense.

A huge portion of what goes on in universities is writing.

edit: Reading this back, I thought I sounded too eager to partner with universities. We're a tech company, and quality and performance will always come first.

eggdaft

0 replies

8h34m

2024-03-27 09:52:01 UTC

Working with universities won't slow your pace of development, because you can work directly with users on that.

But go-to-market, after the initial delay, uni's will drive sales fast at scale.

apex_sloth

0 replies

7h54m

2024-03-27 10:31:34 UTC

I'm surprised that universities have to consider dyslexic students. When I went to university, I was basically told to figure it out. "It's your problem after elementary school. nobody cares."

benpacker

3 replies

1d2h

2024-03-26 16:24:53 UTC

This is really great. I was hoping someone would build this: https://bprp.xyz/__site/Looking+for+Collaborators/Better+Loc...

I would really happily pay $10 / month for this, but what I really want is either: - A Raycast plugin or Desktop app that lets this interact with any editable text area in my environment - An API that I can pass existing text / context + audio stream to and get back a heartbeat of full document updates. Then, the community can build Obsidian/VSCode/browser plugins for the huge surface area of text entry

Going to give you $10 later this afternoon regardless, and congrats!

samstave

0 replies

1d1h

2024-03-26 16:57:23 UTC

Take this [TEXT] read it and then let me tell you how to edit it:

Certainly - let me grok your text!!... OK - I am ready!

BLAH BLAH BLAH...

etc

oulipo

0 replies

4h42m

2024-03-27 13:43:42 UTC

There should probably a community effort to build an open-source version of this around Obsidian?

benpacker

0 replies

5h45m

2024-03-27 12:40:50 UTC

I would also love to integrate this into text areas in my app, or as an editor of a JSON object.

That would let me quickly build an interface for editing basically any application state, which would be awesome!

user_7832

2 replies

1d2h

2024-03-26 16:25:48 UTC

Congrats on the launch!

I absolutely love the idea, as a fellow neurodivergent who works much better over voice than text. My only feedback is... I'd love to run this with more control. I already run LLMs locally (LM Studio), and I can run something like whisper too. I understand that open-sourcing (or even making the source code available) might go against any commercialization attempt. However, there are some options (Red Hat-esque) where it may be possible to charge for business use and allow local running for free for personal use.

On one hand you've got a solid first-mover advantage in a field where lots can benefit and use this, however if someone can bork together several layers of LLM output they might be able to offer competition (and such projects are often opensource, albeit sometimes less "polished".) If you offer a good deal you might have a good chance of major success. Best of luck!

rrr_oh_man

1 replies

11h7m

2024-03-27 07:18:28 UTC

So… what do you want?

user_7832

0 replies

50m

2024-03-27 17:36:03 UTC

Ideally, some way to run this locally on my own machine. That would offer more power (and also allow the product at a lower/no cost without any demands on their servers). Are you from the Aqua Voice team by the way?

tkgally

2 replies

16h42m

2024-03-27 01:43:18 UTC

After watching the video demo and logging in, I was able to compose and edit text easily. Nice job.

My own use case is a bit different from many others who have commented here. I'm a reasonably fast typist and don't currently have any physical or neurological issues that might make typing difficult. I have tried voice input methods a number of times over the years, as I thought speaking would be faster than typing, but I always went back to typing due to accuracy problems and difficulty editing.

Aqua Voice does seem to be a significant advance. I'm going to try it out from time to time to see if I can get comfortable with voice input. If I can, I will subscribe.

I drafted this comment using Aqua Voice, but I ended up editing it quite a bit with a keyboard before posting.

the_king

0 replies

12h19m

2024-03-27 06:07:00 UTC

Appreciate it.

I think the preference for voice versus typing is something that hits everyone differently, and I think as the reliability and speed improves, more and more people will find themselves using voice as a "tool in the toolbox," which aside from the occasional "Hey Siri, set a timer," isn't the case today.

harryp_peng

0 replies

15h30m

2024-03-27 02:56:12 UTC

I don't think there is genuine use cases to this except for accessibility features etc. I always doubted voice; however, I'm beginning to see that Oai Voice Engine will potentially be huge and incredibly addicting to chat to. I.e. it's more interesting to confide in a human sounding 'friend' that vanilla ChatGPT with keyboards.

saretup

2 replies

12h7m

2024-03-27 06:18:44 UTC

This is really nice. I tried it out on the app and wanna start using it more. The only issue stopping me is the privacy policy.

I understand why you have to retain the voice data, however, is there any way you can implement an opt-out feature?

I’m just not comfortable with my voice data lingering in some servers, ready to be used for training ml models.

moomoo11

1 replies

12h6m

2024-03-27 06:19:35 UTC

How else can it get better?

rrr_oh_man

0 replies

11h7m

2024-03-27 07:19:03 UTC

Money

reissbaker

2 replies

15h15m

2024-03-27 03:10:19 UTC

This is excellent and very impressive. Have you thought about offering this as an API? I'd bet there are lots of startups that want to easily integrate better speech-to-text for conversational AI (e.g. word correction, adding punctuation, etc), and would pay you for the service.

(Personally, I would! Email me at matt@syntheticdreamlabs.com if you're interested in offering it as an API, I'd be pretty curious about pricing.)

tebbers

0 replies

13h29m

2024-03-27 04:57:00 UTC

Yes, upvote for this please! We would pay to integrate this into our SaaS for our users.

jmcintire1

0 replies

11h57m

2024-03-27 06:28:36 UTC

We have thought about this: there are some boring reasons that it doesn't make sense to do right now but I would definitely be interested in learning about your use case and keeping you posted if and when we decide to do it!

moconnor

2 replies

1d1h

2024-03-26 16:29:37 UTC

You don't say so explicitly, but it'd be good to know what data goes to the cloud - I presume all of it including speech recordings? Or is STT on device? Also what your privacy / retention policies are around this data.

Excellent demo and great-looking product btw!

geor9e

1 replies

2024-03-26 17:47:42 UTC

I just spent 10 seconds trying it. It was able to interpret my intentions and parse out commands from the literal transcription. "bazinga but in all caps and with a j" became "BAZINJA". So at the minimum, it's going through an LLM in the league of llama, which if run locally in browser is slow as molasses on my ancient MacBook. So it's definitely going to the cloud. As a rule of thumb, you should just assume any website you didn't completely code yourself is sending every mouse movement and every text that you type and then backspace, including passwords, to a cloud big data analytics repo via a few javascript listeners.

blueberrychpstx

0 replies

21h25m

2024-03-26 21:01:08 UTC

That’s a hilarious over assumption but point taken

Also I really enjoyed your analysis

jppope

2 replies

1d2h

2024-03-26 16:04:33 UTC

the signup just failed for me. the console was logging out the token... you might want to fix that

jmcintire1

1 replies

1d2h

2024-03-26 16:25:07 UTC

patching now!! good catch.

jmcintire1

0 replies

1d1h

2024-03-26 16:26:45 UTC

should be fixed.

gleb

2 replies

22h31m

2024-03-26 19:54:59 UTC

Tried it. Seemed quite impressive. Two issues:

- it consistently uses word two instead of to

- forcing Google OAuth as the only way to sign up is not a good idea. That prevented me from signing up.

the_king

1 replies

12h36m

2024-03-27 05:50:00 UTC

Did you wait for the text to turn blue and then black? And were the twos still wrong then? The real-time text is non-final tokens and has many more errors than what is ultimately committed to the document (but committing is slower than we'd like at the moment).

gleb

0 replies

2h39m

2024-03-27 15:47:03 UTC

Yes I did. I even later tried to tell it to fix this, and was not successful.

dools

2 replies

13h32m

2024-03-27 04:53:48 UTC

I used this to type an email on my phone, the editing process in particular was very smooth once I realised I had to wait a little bit. You can have ten of my dollars per month.

the_king

1 replies

12h34m

2024-03-27 05:51:37 UTC

Thanks! Sorry about the lag, today was slower than normal, but we need to improve latency overall.

dools

0 replies

11h26m

2024-03-27 06:59:29 UTC

I've got a pinned tab and the site saved to my home screen, I look forward to using it as you build it out, very promising product well done.

whiplash451

1 replies

19h34m

2024-03-26 22:52:06 UTC

Congrats on the launch. The demo is truly impressive. On my Apple cell phone with a Chrome browser, the latency feels a little sluggish (I am sure you are working on it). Congrats again and all the best!

the_king

0 replies

12h29m

2024-03-27 05:56:38 UTC

Thanks, appreciate it. We can do a lot better than most people experienced today in terms of latency.

tremarley

1 replies

2024-03-26 17:49:23 UTC

How much would 1000 tokens give us?

the_king

0 replies

20h50m

2024-03-26 21:36:08 UTC

It's 1000-1500 words. I know that seems cheap of us but the cost to run the Aqua stack is eye-watering right now. We will increase this amount as we optimize.

samstave

1 replies

20h56m

2024-03-26 21:29:28 UTC

Train this model on this:

>Dearest creature in creation, Study English pronunciation. I will teach you in my verse Sounds like corpse, corps, horse, and worse. I will keep you, Suzy, busy, Make your head with heat grow dizzy. Tear in eye, your dress will tear. So shall I! Oh hear my prayer.

>Just compare heart, beard, and heard, Dies and diet, lord and word, Sword and sward, retain and Britain. (Mind the latter, how it's written.) Now I surely will not plague you With such words as plaque and ague. But be careful how you speak: Say break and steak, but bleak and streak; Cloven, oven, how and low, Script, receipt, show, poem, and toe.

>Hear me say, devoid of trickery, Daughter, laughter, and Terpsichore, Typhoid, measles, topsails, aisles, Exiles, similes, and reviles; Scholar, vicar, and cigar, Solar, mica, war and far; One, anemone, Balmoral, Kitchen, lichen, laundry, laurel; Gertrude, German, wind and mind, Scene, Melpomene, mankind.

>Billet does not rhyme with ballet, Bouquet, wallet, mallet, chalet. Blood and flood are not like food, Nor is mould like should and would. Viscous, viscount, load and broad, Toward, to forward, to reward. And your pronunciation's OK When you correctly say croquet, Rounded, wounded, grieve and sieve, Friend and fiend, alive and live.

>Ivy, privy, famous; clamour And enamour rhyme with hammer. River, rival, tomb, bomb, comb, Doll and roll and some and home. Stranger does not rhyme with anger, Neither does devour with clangour. Souls but foul, haunt but aunt, Font, front, wont, want, grand, and grant, Shoes, goes, does. Now first say finger, And then singer, ginger, linger, Real, zeal, mauve, gauze, gouge and gauge, Marriage, foliage, mirage, and age.

>Query does not rhyme with very, Nor does fury sound like bury. Dost, lost, post and doth, cloth, loth. Job, nob, bosom, transom, oath. Though the differences seem little, We say actual but victual. Refer does not rhyme with deafer. Feoffer does, and zephyr, heifer. Mint, pint, senate and sedate; Dull, bull, and George ate late. Scenic, Arabic, Pacific, Science, conscience, scientific.

>Liberty, library, heave and heaven, Rachel, ache, moustache, eleven. We say hallowed, but allowed, People, leopard, towed, but vowed. Mark the differences, moreover, Between mover, cover, clover; Leeches, breeches, wise, precise, Chalice, but police and lice; Camel, constable, unstable, Principle, disciple, label.*

>Petal, panel, and canal, Wait, surprise, plait, promise, pal. Worm and storm, chaise, chaos, chair, Senator, spectator, mayor. Tour, but our and succour, four. Gas, alas, and Arkansas. Sea, idea, Korea, area, Psalm, Maria, but malaria. Youth, south, southern, cleanse and clean. Doctrine, turpentine, marine.

>Compare alien with Italian, Dandelion and battalion. Sally with ally, yea, ye, Eye, I, ay, aye, whey, and key. Say aver, but ever, fever, Neither, leisure, skein, deceiver. Heron, granary, canary. Crevice and device and aerie.

>Face, but preface, not efface. Phlegm, phlegmatic, ass, glass, bass. Large, but target, gin, give, verging, Ought, out, joust and scour, scourging. Ear, but earn and wear and tear Do not rhyme with here but ere. Seven is right, but so is even, Hyphen, roughen, nephew Stephen, Monkey, donkey, Turk and jerk, Ask, grasp, wasp, and cork and work.

>Pronunciation — think of Psyche! Is a paling stout and spikey? Won't it make you lose your wits, Writing groats and saying grits? It's a dark abyss or tunnel: Strewn with stones, stowed, solace, gunwale, Islington and Isle of Wight, Housewife, verdict and indict.

>Finally, which rhymes with enough — Though, through, plough, or dough, or cough? Hiccough has the sound of cup. My advice is to give up!!!

=====

I dont have the energy to defend a F up -- but there is a LOT of really cool development happening on HN... from AI to all sorts of SHOW and ASK and a just F-TON of keeping track.

Iam not an OCD content influencuer focused type...

But know --

the VELOCITY of thought that is flowing through HN and human conscious as excelleratated by our tipping-the-cup on AI is having IRL consequences on both mentality and reality....

If there is a community for a higher velocity firehose of where we are going share it.

So - we are sprewing a firehose of ideas into the quantum future, as unknown-boomerangs

The truth is to understand the boomerangs...

( to de-vague-lize this: Tesla:

Pre compute an AI token. 3:6:9

This token is a prime reflection of that.

pablopeniche

0 replies

20h36m

2024-03-26 21:49:59 UTC

lol based, we'll do

raylad

1 replies

17h10m

2024-03-27 01:15:29 UTC

I tried using it for a screenplay and it more or less knew the format, but didn't remember what to do with the blocks of text or how to separate them properly.

It would be handy to be able to select various formats and have it know to keep to that format.

Also I really liked it but found it quite slow. Assuming that will improve over time.

jmcintire1

0 replies

12h27m

2024-03-27 05:59:14 UTC

100% agree on speed — there's lots we can and will do to improve that

also good point on memory. this turns out to be relevant in a lot of cases!

passion__desire

1 replies

23h49m

2024-03-26 18:36:56 UTC

Just wanted to inform you that your demo video is actually unlisted and invisible to public. I hope that is not intentional.

the_king

0 replies

23h41m

2024-03-26 18:44:38 UTC

Fixed. Thanks for the heads up.

nylonstrung

1 replies

1d2h

2024-03-26 15:47:31 UTC

This is very cool. I would immediately buy it if someone ends up making an Obsidian plugin

tremarley

0 replies

2024-03-26 17:50:44 UTC

This would be very effective

kaspian

1 replies

11h3m

2024-03-27 07:23:14 UTC

Well done! Without context I was a bit confused regarding the usability given the speed of the models (which of course can and will be improved), but given your story I'm sold and sure this will help a lot of people.

More input modalities are IMO always better, and being able to switch to this in the future when your fingers get tired would be awesome, my key painpoint from the demo is the speed, but I'm sure that will go up as models and inference speed gets better.

One awesome thing would be to integrate the contextual understanding with a programming copilot, so you can pair program with only your voice as input.

Rooting for you guys!

the_king

0 replies

10h32m

2024-03-27 07:53:32 UTC

Thank you. 100% I'm with you on speed.

I think we will be in a much better place speed-wise in a few months; some of that will be our stack, and some of that is what is happening lower down in the stack, but it will be meaningfully faster and more responsive soon.

justinlloyd

1 replies

1d1h

2024-03-26 16:49:29 UTC

From one dyslexic to another, who never got the option to even use a computer in school or college and instead was forced to write out everything long-hand, thank you so much for this.

I use voice-to-text in the workshop and when taking notes and reviewing a PR. And all the current options are pretty much what you would expect. More focused on accuracy, which is usually quite poor, which, to paraphrase, "It's Erin with an E. Oh for **s sake, Erin. ERIN! E. R. I. N. <pause> N. I said N. Eh-rin. Fine. Whatever." so anything that can improve on that experience will be immensely helpful.

Looking forward to seeing where you go with this, and I hope at some point you make a native desktop application.

the_king

0 replies

2024-03-26 18:00:00 UTC

I think I developed "bad handwriting" partially to hide misspellings—this is necessary in school though.

On pure WER we are state-of-the-art in our testing, but more importantly, mistakes in Aqua are correctable.

So you can speak your mind instead of having to wait until you have the perfect sentence and then dictate it.

That said, we know it's not perfect, but we know a few more months of work will have it really solid.

justanotheratom

1 replies

1d2h

2024-03-26 15:56:20 UTC

This is awesome.

Video talks about a Mac App. Where can I get that?

Voice input did not work on Edge browser on Windows, btw.

the_king

0 replies

1d1h

2024-03-26 16:41:38 UTC

Thanks!

We had to make a bunch of breaking API changes over the last week and the Mac app isn't ready to go on it quite yet, but we'll bring it back as soon as we can, max two weeks, hopefully sooner.

jellyfish24

1 replies

11h6m

2024-03-27 07:19:41 UTC

Super cool and I'm excited to try it out more as I also prefer voice to typing.

I'm think I'm having trouble getting it to edit the existing text when writing though? https://i.imgur.com/IfoWvMG.png

the_king

0 replies

11h0m

2024-03-27 07:25:33 UTC

Yikes! I think we may have normal mode a bit too conservative at the moment. We've got a lot more tuning to do.

One thing you can try if Aqua doesn't seem to be "getting" what you're after is to say something like "make this the email," or "transform this into the email I want to write." Sometimes it needs an extra push.

jdalgetty

1 replies

2024-03-26 17:48:02 UTC

Not working for me on firefox/macos

the_king

0 replies

23h38m

2024-03-26 18:48:12 UTC

Sorry about that, will fix asap. I love firefox.

jasonjmcghee

1 replies

1d2h

2024-03-26 15:27:09 UTC

What are your opinions on https://github.com/cursorless-dev/cursorless?

Are you targeting developers?

My understanding was people who are serious about developing via voice use it pretty exclusively.

Like, yeah you need to learn commands, but "are often not worth it" feels like brushing a pretty massive offering under the rug.

Is learning vi / emacs commands not worth it (or shortcuts in another IDE?)

Is there a middle ground?

the_king

0 replies

1d2h

2024-03-26 16:03:44 UTC

Cursorless is really cool, but we see the ideal computer-voice interaction a little differently.

Our approach is based around understanding intent from speech alone. We think this will be the ideal division of labor between man and machine going forward - let the person think and the machine fit it into the document/file/text. Over time we think this will reduce the number of commands you have to learn to use it to zero.

But our "command-less" approach isn't reliable for every use case yet - and as a fan of voice interfaces I am rooting for Cursorless - it's super sci-fi.

iknownthing

1 replies

1d2h

2024-03-26 15:38:00 UTC

Are these your models or a wrapper around model apis?

the_king

0 replies

1d1h

2024-03-26 16:49:29 UTC

We use our own fusion model in the transcription pipeline for intent understanding from encoded audio, but most of the rewriting tasks like "Turn this into a list" call out to fine-tunes of GPT-4. It's a combination.

The fusion model is similar to the architecture described here: https://arxiv.org/abs/2310.13289

hidelooktropic

1 replies

1d2h

2024-03-26 16:18:31 UTC

This was such a well executed demo. A few seconds in and I'm seeing the value. The core of the product is fully explained in just 36 seconds.

It's less about how quickly all that transpires and more about presenting the product in a way that doesn't require a lot of talking around it. Well done.

matsemann

0 replies

1d1h

2024-03-26 17:12:45 UTC

I agree, very well spent seconds. Straight to the point and immediately obvious what the product is doing and how useful it could be.

My first thought, when reading the headline, was that this could be useful for my coworker that got RSI in both hands and codes using special commands to a mic. But after having watched it I think it can be much more than such a niche product.

habosa

1 replies

14h53m

2024-03-27 03:32:39 UTC

This is super impressive!

I am really looking forward to the day when dictation like this can be done locally on our phones. I'd really like to do a lot of my basic messaging with my voice but the need to do corrections with that tiny keyboard means it's not much of a time saver.

jmcintire1

0 replies

11h56m

2024-03-27 06:29:46 UTC

100%. This day will come!

feverishaaron

1 replies

21h26m

2024-03-26 20:59:44 UTC

My child is profoundly dyslexic. This kind of tool is a game-changer for him.

the_king

0 replies

20h54m

2024-03-26 21:32:05 UTC

Hope this can be helpful. We know there are still many kinks to iron out.

On another note, I think once you leave school dyslexia can become a wash or even a net positive in the right setting. I think whatever the brain config is can be a huge unlock for creative thinking - it's not always super helpful in the school context, but can be really asymmetric in tech and probably other industries.

eiiot

1 replies

12h16m

2024-03-27 06:09:50 UTC

One of the first Launch HN products I've been excited about in a long time. I'm a student, and really looking forward to using this to write papers, assignments, emails, etc.

Congrats!

jmcintire1

0 replies

12h7m

2024-03-27 06:18:36 UTC

Thanks Eliot! Let us know how we can make it better for you!

deepGem

1 replies

11h44m

2024-03-27 06:42:15 UTC

Impressive demo.

I noticed a correction that was done retroactively in the demo

'make that H100 GPUs'

and noticed that there was only one instance of the token GPU. Hence the correction was seamless. Had there been a couple more instances I guess all GPU tokens would have been replaced by H100 GPUs. I guess you could say make that NVIDIA H100 GPUs that would be more accurate but if there were multiple instances and you needed the change only in one instance, not sure how that'd fly. I am nitpicking but this could be a common theme.

The fact this can retroactively change the text and also understand a command is quite brilliant. I don't see any trigger word for a command, so wondering if I needed the command as part of actual text, how would that work ?

the_king

0 replies

11h20m

2024-03-27 07:06:12 UTC

Great question. I just tested it, and Aqua was smart enough to figure out which "GPU" I was talking about using context.

In the example below I asked it to "make it H100 GPUs" and it only modified the GPU in the list.

Aqua isn't perfect though, and while I think we are mostly solving this case there are plently where we need to do better.

---

Hey Team,

I just had a chat with the marketing people, and they confirmed we should buy enough GPUs to meet the demand.

Our Equipment List:

- 1,000 H100 GPUs

- 1,000 processors

- 500 NVME SSDs

- 200 standard racks

edit: formatting

atlintots

1 replies

14h27m

2024-03-27 03:58:58 UTC

No way, I was literally mulling over the exact idea of voice-driven text editing (focused towards programming), using a mix of voice commands and usual speech to text. This is really exciting to see!

the_king

0 replies

12h18m

2024-03-27 06:07:53 UTC

The programming use case is pretty interesting.

ashton314

1 replies

12h30m

2024-03-27 05:56:12 UTC

Holy crap I’m blown away by the demo. That was so easy and natural to use.

I had a brush with RSI some years ago. I’m good now with an ergonomic keyboard and better habits (water, exercise) but it brings me great comfort knowing something like this exists. Thank you!

jmcintire1

0 replies

11h59m

2024-03-27 06:26:37 UTC

Thank you for trying it! Our goal is to make it not just a suitable alternative but way better than your keyboard. Still a lot of work to do, but that is the fun part :)

apinstein

1 replies

1d2h

2024-03-26 15:46:36 UTC

This is really great. I imagined such a thing should be created, amazing to see it in reality. It would be great for those of us not limited to exclusively voice to be able to use commands as well, as I still think in some cases doing explicitly what I want for simple things is easier than figuring out how to explain it :)

the_king

0 replies

1d2h

2024-03-26 16:08:32 UTC

We agree totally; voice only can be ridiculous, for example, if you're spelling out a username or something.

The sandbox doesn't have typing, but the full app does - you can switch between typing and talking seamlessly there.

(written with Aqua)

ajolly

1 replies

22h53m

2024-03-26 19:33:07 UTC

I'll certainly go give this a spin later as I use voice to text daily. My first few questions:

How's the dictation accuracy compare to Talons latest model, or Microsoft's new voice access? Or dragon? You've got a few comparisons already but nothing that I actually use.

What's the latency like?

At least for me a general voice editor isn't useful, give me something that can send text to wherever my mouse is pointing and that's useful. Then make sure it works with Microsoft's voice without borders, synergy, barrier, input director etc.

Oh and does it support a user dictionary?

the_king

0 replies

22h35m

2024-03-26 19:51:08 UTC

We'll be releasing a custom dictionary and templates soon. We are testing them internally now, and they aren't quite reliable enough to release, but we understand how important this is for many workflows.

On accuracy, we benchmark very well against even large async models, with a WER of .05-.06 and when Aqua does make a mistake you can often correct it by just telling it "no it's our side not outside" and it won't mangle the text.

aaroninsf

1 replies

2024-03-26 17:55:13 UTC

Infinite details to remark on, but,

NO NOTES.

This is the sort of the thing that I forward to people who are skeptical about the disruptive capacity AI has, to take long-standing seemingly intractable problems, and "solve" them.

Hats off. Truly inspiring in many senses!

the_king

0 replies

10h24m

2024-03-27 08:01:59 UTC

Thank you very much. I can't say how much these comments motivate us.

When I was a kid I saw tech as a really aspirational thing (jobs, iPod, touch screens, Skyrim). What can be done with modern GPUs and transformers is making me feel like that again.

_qua

1 replies

19h9m

2024-03-26 23:16:37 UTC

This is awesome, will likely subscribe--just need to pare down some of my other subscriptions--there are too many tempting AI products lately.

jmcintire1

0 replies

12h26m

2024-03-27 05:59:38 UTC

I understand the feeling :) great to hear!

WheelsAtLarge

1 replies

21h10m

2024-03-26 21:15:41 UTC

WOW!!! Just wow...

When will we PC peeps get to use it?

the_king

0 replies

21h4m

2024-03-26 21:21:54 UTC

You can use it in the browser right now! but we get... native is better for voice stuff and we'll be in more places soon.

Merik

1 replies

14h25m

2024-03-27 04:00:44 UTC

Nice work and congratulations on the launch! It’s awesome when a project works as expected/assumed. I immediately started using it to build my packing list for my upcoming weekend away.

Three suggestions:

1. If I try the demo on the landing page, and then sign up, I wanted it to automatically have the contents from my demo usage on the landing page copied into my newly created account.

2. I would like another option than an auto renewing subscription. If I could have loaded 5$ or 10$ that my usage would deduct from, instead of having to subscribe, I would have paid money right then and there, but I didn’t because I have too many damn subscriptions (alternative is to offer a non auto renew $10 one month trial).

3. iOS app please :)

jmcintire1

0 replies

12h21m

2024-03-27 06:04:48 UTC

Thanks Merik! "It does what it says " is honestly my favorite feedback...

1. Good call on demo page, would be slick to sync that up with account. 2. Subscription exhaustion is a real thing — credits/usage based billing is an interesting 3. asap!

GordonS

1 replies

1d2h

2024-03-26 16:13:53 UTC

I have neuropathy in my arms, so this is something I'm very interested in!

Do I have to use a specific Aqua Voice text editor, or can I use it in apps like JetBrains Rider and Visual Studio Code? If so, are there some kind of plugins that would allow using IDE-specific features? (e.g. "build and run the API project")

jmcintire1

0 replies

2024-03-26 18:17:45 UTC

Hey! Right now our focus is getting the core tech solid and we can do that much faster if we aren't juggling multiple platforms and plugins (we learned this the hard way), but after that we are going to blitz into as many places as possible.

FlamingMoe

1 replies

1d1h

2024-03-26 16:29:56 UTC

First impression: Wow, this is awesome.

So let's say I work in a quiet home office by myself. Could I just have Aqua open throughout the day and give it notes / to-dos without having to click the microphone on/off each time?

jmcintire1

0 replies

1d1h

2024-03-26 16:37:43 UTC

Thank you! And yes, the app has a Background mode which is designed for this use case exactly

youssefabdelm

0 replies

22h57m

2024-03-26 19:29:08 UTC

I feel like I'd much prefer this as an API I can request and get realtime updates from so that I can hook it into any application. Is that on the roadmap?

Also latency seems to be a bit slow, wish it was faster, maybe thats due to traffic now

tumidpandora

0 replies

22h49m

2024-03-26 19:36:46 UTC

This site can’t be reached(?)

throwaway4aday

0 replies

2h45m

2024-03-27 15:41:09 UTC

As many others have noted, once you've got everything stable (and hopefully profitable) you should seriously explore a way to use this as input into any text field in any program. Microsoft is actively experimenting with something similar in Copilot Voice although theirs is very integrated with the editor and specialized for code. It would be great to have these types of voice interfaces in all software. Maybe you could look at providing a way to integrate with your system through an API so others could do the heavy lifting of creating a native experience for each app?

Absolutely amazing product by the way! The 1000 free tokens is enough, the fact that people are complaining about running out too soon is good, it shows that they like the product and want to use it more. They do have a point about adding a rough word count, maybe just a subheading that says "on average, X spoken words".

theonething

0 replies

22h42m

2024-03-26 19:43:48 UTC

Aqua is smart enough to figure that out and to only take the last version of the sentence

I wish Siri, Alexa, et al would do this as well. They seem to expect you to speak perfectly the first time.

selvan

0 replies

14h7m

2024-03-27 04:18:58 UTC

Nice, (code like) Refactoring meets speech-to-text

ryanisnan

0 replies

1d1h

2024-03-26 17:07:45 UTC

Wow, great demo! Excited to see this grow.

ragebol

0 replies

11h7m

2024-03-27 07:18:32 UTC

This is great! I'd love this as a plugin for Obsidian :-)

paviva

0 replies

1d1h

2024-03-26 16:42:54 UTC

Great work, really hope you'll be able to pierce into the medical market eventually. Dragon is still useless to anyone who can touch type.

parentheses

0 replies

1d2h

2024-03-26 16:03:07 UTC

This is very well done!!

moonshotideas

0 replies

15h42m

2024-03-27 02:43:46 UTC

Love this app, love the story and everyone being so supportive is making my day! Hope you guys go all the way!

mike31fr

0 replies

6h14m

2024-03-27 12:11:23 UTC

Amazing! I'm french and I'd like to know if there is a chance I can dictate in french to Aqua Voice someday? If yes, any idea when that would be implemented? Great work!

michaelbuckbee

0 replies

2024-03-26 17:41:50 UTC

Friendly FYI - not sure if this is a skill issue on my part or something that's not possible yet, but I couldn't figure out how to change the audio input. I think when it asked for microphone access (chrome latest, Mac) that it chose the Macbook microphone which won't work as it's docked.

methodicallymad

0 replies

11h56m

2024-03-27 06:30:04 UTC

Are you saving the voice recordings and/or using that data for training?

mauroszu

0 replies

2h3m

2024-03-27 16:23:10 UTC

This is useful, I hope this is enabled for other languages soon

m0d0nne11

0 replies

52m

2024-03-27 17:33:42 UTC

Paywalled behind Google Oauth? Why?

lxe

0 replies

23h0m

2024-03-26 19:25:29 UTC

Fascinating. Are you still using Whisper in any of these MoExperts to tanscribe or do you have something custom? Would love to learn more about the tech.

lukko

0 replies

20h4m

2024-03-26 22:21:19 UTC

This is amazing! It's very satisfying to use and the combination of transcription + intent seems like it has huge potential.

I would love to use this in healthcare for dictating patient letters etc. I guess a local model / HIPAA compliance is some way off?

lolpanda

0 replies

20h4m

2024-03-26 22:21:51 UTC

I love this idea. Wish there's a browser extension so I can dictate in my emails.

joshstrange

0 replies

3h26m

2024-03-27 15:00:12 UTC

I think I would pay for this in a heartbeat if it was more "available", as in not just a web app. I'd love a native app where I could use this in any textbox (web, app, etc) on my Mac. Ideally I'd be able to remap the key I use for mac dictation. I think I'd be fine with a popup where I write all the text and then it just inserts it after it finishes (so you can just render your UI and paste in the result instead of needing to interact with existing text fields.

jbmsf

0 replies

17h36m

2024-03-27 00:50:06 UTC

I love it when someone shares an idea that I wouldn't have considered (which is on me) and so clearly solves a real problem.

ikliuger

0 replies

1d1h

2024-03-26 16:40:40 UTC

This is super awesome. Do you develop your own models, or is this a wrapper around existing APIs? It would be great to have a way to introduce environment variables like my name, my preferences, and the topics I usually write about. I've actually written this comment using your service. Thank you. Looking forward to seeing what it becomes.

idk1

0 replies

10h25m

2024-03-27 08:00:19 UTC

This is incredible, I said go back and swap one word with another, and it did it, this has blew my mind, I've not been able to do that before.

I'm a heavy voice dictation user, and I would switch to this in a heartbeat. I'll tell you why this is so impressive, it means you can make mistakes and correct them with voice, it takes away the overhead of preparing a sentence in your mind before saying it, one of the hardest things about voice dictation.

I often have my shoulder in pain, and I have to reach for my mouse to change a word, I would not if I used this. This software would literally prevent me pain.

However, I cannot use it without a privacy policy. I have to know where the recording of my voice is being saved, if it's being saved, and what is it going to be used for.

I would pay extra for my voice to be entirely deleted and not used afterwards, that could even be an upsell part of your packages. Extra $5 to never save your voice or data.

I love it, but I can't use it for most things without a privacy policy.

heaths1

0 replies

2024-03-26 18:03:55 UTC

Awesome demo. A challenge where I work is an extreme acronym rich lingo. Is your model open to extension or learning in some fashion to accommodate picking up thousands of acronyms? We also can shift into rapid, specialized speaking patterns that I think are quite learnable but that are not really 'out of the box' for normal software products. I would think many industries could feature their own lingos like this.

hamzakc

0 replies

2024-03-26 17:32:34 UTC

Things like this always remind me of this excellent talk: https://youtu.be/8SkdfdXWYaI?si=MFxs7wFdqws0OeCi

Worth a watch.

geniium

0 replies

9h44m

2024-03-27 08:41:22 UTC

The video demo is sooo slow. Seems like I would type it faster

genewitch

0 replies

2024-03-27 18:20:18 UTC

it's susceptible to subtle prompt injection. I got it to output LLM like responses to just snippets of speech.

freedomben

0 replies

23h30m

2024-03-26 18:55:40 UTC

Some of my thoughts:

1. This is an amazing idea!

2. I love that it is browser-based so can work everywhere. Native app would let you integrate more tightly (such as becoming a "keyboard" on the system), but that probably means "a mac app" which doesn't do me any good on Linux. If you could keep the bulk of it in cross-platform tech and just do the small integration part with native code, I think supporting at least "the big three" is doable. I bet if you provided a good API, somebody in the open source world would even do the work for you, on Linux at least.

3. Would really prefer being able to sign up with my email, and not having to log in with a third party account.

4. Online-only access is definitely fine for now, but to stay competitive in the future I would keep an eye toward being able to run inference locally so you don't have to be online to use it. This would also be a way for you to reduce costs and offer a cheaper version. If I were you, my long-term goal would be for this to be used by everybody (though that's years down the road). Local inference does complicate monetization, but that can be figured out.

5. For me to really use this enough to pay out every month, it needs to be relatively easy for me to get the output into whatever app I'm using, whether that is Chrome, Slack, Gmail, Google Docs, Vim, Gedit, or anything else. This is undoubtedly related to item 2 above, but I figured it warranted it's own mention as there may be solutions besides browser-based vs. native.

6. You're gonna have competitors hot on your heels, if they aren't already. Google in particular with GBoard on Android could be absolutely killer. Since it is Android-only, I don't think it's a major competitor now, but if they broadened it absolutely could be.

7. Do you have an exit strategy in mind already? Would you be willing to share anything on that? (I ask because it's relevant because your product could easily become part of my standard workflow, and I'm very conservative about becoming dependent on proprietary products, especially from startups). Please do not go native-only and only release a Mac app. At a minimum, please maintain the web-based version. And please for the love of all that is holy, don't sell to/get acquired by Apple! I want and need your product, and I don't and won't switch platforms (Fedora Linux currently) to get it.

Really amazing idea and great work! It is rare that I see products that I think could actually "change the world" but this one has some potential by changing the way we interact with our computers!

evrenesat

0 replies

7h38m

2024-03-27 10:47:55 UTC

I've just subscribed! Congratulations for this well crafted, immediately useful tool. It's something I was already looking for.

Some Feature Requests:

- Please implement proper undo/redo history and allow us to use it with voice, GUI and keyboard.

- When I take over with keyboard, do not mess while I'm writing.

- When I start speaking, if the text cursor is in between words, I'd expect it to insert, not append. (if you want to avoid this from happening accidentally, after some idle time, you may move cursor to end of the text)

- A little advanced feature; It would be very nice to have some form of tagging for sentences and/or paragraphs with numbers, text or colors. So it would be easier to delete, reorganize, move around them by saying like "move the green section before purple one" or "part 12 should go before part 4", then we can let the language model to do it's magic and rephrase/reformat it a bit. Your AI probably smart enough to understand which sentence we are referring to by just hearing part of it, but as a user we may feel too lazy for that.

- Add a cheat-sheet for what's possible next to editor. Something more succinct than the notion doc.

- Allow pausing/resuming dictation with voice command.

Note: I would really like to be able to benefit from its smart features while I'm not sitting in front of my computer or holding my phone. But I'm not sure there will be a day when Apple let us interface with smarter AI agents like ChatGPT or Aqua Voice on iOS while the screen is locked. IMO, this is gatekeeping of an inferior feature (Siri) in the name of protecting my privacy. I hope some day EU will also intervene with that.

dharma1

0 replies

20h4m

2024-03-26 22:21:46 UTC

This is super cool! Should ideally happen at the OS level (some future version of Siri) across whatever apps you’re using

daotoad

0 replies

18h38m

2024-03-26 23:48:16 UTC

As others have said, good job.

This seems like it would be particularly good on a telephone or my watch. In those places it seems like a real game changer in terms of ability to take notes when the keyboard experience is less than awesome.

Have you tried using it to write code? This could be amazing as a IDE/text editor plugin.

It's nice to see someone do something that's not regrettable with AI. So many of the applications we see are horrible. What you've made is brilliant and very far from being just another cursed chocolate factory experience.

cryogenicplanet

0 replies

19h8m

2024-03-26 23:17:45 UTC

This looks very sick!

any chance of getting this as native apps on mobile? or better yet like global macos utility like dictation so you can "type" in any apps?

ceroxylon

0 replies

15h8m

2024-03-27 03:17:34 UTC

Impressive to see the total lack of response for "umm" and "ahh" filler sounds (in the text editing area), it does seem to recognize them for what they are. Also the rest of it seems valuable, especially the formatting abilities.

benpacker

0 replies

5h42m

2024-03-27 12:44:09 UTC

One more piece of feedback: I'd like the new note page to have a unique route so that I can make a shortcut for creating a new note to easily put somewhere else in a document

ayeager

0 replies

4h51m

2024-03-27 13:34:45 UTC

would subscribe in a heartbeat if i could upload audio files for transcription.

authorfly

0 replies

11h1m

2024-03-27 07:25:10 UTC

Nice demo, I like this.

Will this work for programming languages? I often would like to type code by talking and especially, to talk into AI coding tools that help you complete code.

Will this allow me to add links to text based on context? Like "Go to <mysite>/blog and link to the article on Voice AI" - I do this quite a lot in my article writing.

Will this work over the sound of me cooking a loud fry up?

Take all 3 and I may never leave the kitchen again.

aronhegedus

0 replies

2024-03-26 18:10:31 UTC

I liked how easy the demo was to play around with! I don't have the most amount of use for this product, but kudos to making something that clearly works very well!

appel

0 replies

6h45m

2024-03-27 11:40:33 UTC

That looks amazing, congrats!

Minor friendly heads up, the withaqua.com link in the description of your youtube video is currently not a link.

aneeqdhk

0 replies

1d1h

2024-03-26 17:00:19 UTC

Hands down one of the best AI demos I have seen. Last time I got a wow feeling like this was when ChatGPT was released.

amirhirsch

0 replies

1d1h

2024-03-26 16:29:22 UTC

Congratulations! This is really cool. Maybe your website could just load into the demo? Have a talking avatar that looks like a paperclip with googly eyes to explain how to use it...

edit: I refreshed and then it did load with the blue mic button

aminick

0 replies

2h44m

2024-03-27 15:41:56 UTC

Beatiful. We are exploring new ways of human-machine interaction.

Tistron

0 replies

2h48m

2024-03-27 15:38:03 UTC

Cool product. I signed up.

I wish there was a clear way of sending you feedback though, there are some details that annoy me a lot.

I'm transcribing some recorded sound, and after I have had it transcribed and I am editing it, every time I tab away from the browser the cursor position gets lost. The focus on the textarea is also lost, and when I click it, it doesn't insert the cursor where I click but at the start of the document so I even loose my scroll position.

As a paying customer I'd hope to have a way to give you this kind of feedback. It should be fairly easy to make this a much better experience.

Really cool product all in all though! I don't often subscribe to stuff.

SuperHeavy256

0 replies

9h51m

2024-03-27 08:35:15 UTC

How soon can I use this to write code?

SimonDuerr

0 replies

5h32m

2024-03-27 12:53:51 UTC

Really cool product! The demo worked well for me. This must be incredibly useful to so many people.

Onawa

0 replies

1h44m

2024-03-27 16:41:26 UTC

Just thought I'd let you know about this event that popped up in my inbox that I think you should definitely attend.

2024 Bridge2A Voice Symposium | Voice as a Biomarker of Health

https://www.eventsquid.com/event.cfm?id=22807

The 2024 Voice AI Symposium will be a groundbreaking 2-day event and unique opportunity to connect with stakeholders invested in artificial intelligence and voice biomarkers. This year's symposium will serve as a nexus for dialogue, collaboration, awareness, and engagement across diverse sectors and members of the community about the use voice of artificial intelligence in healthcare. Attendees will experience dynamic speakers, panels, and networking opportunities. Innovative interactive events include a Call for Science with 3 submission categories, a Voice AI Tech Fair, and a patient challenge competition.

FloatArtifact

0 replies

20h34m

2024-03-26 21:52:13 UTC

Congratulations on an interesting project. There is a lost opportunity with your natural language only approach. The issue is natural language will never be efficient as an interface. Natural language helps with low domain knowledge. That's the plus side as it allows the end user to say a variety of phrases to get the desired result. Commands allow for surgical precision and efficiency/less voice strain for its end user. So there needs to be an approach that allows for both elements natural language and commands. As users develop their own process and workflow they will create actions as commands. (high domain knowledge)

Since these commands are self created by the end user they remember them for their specific purposes. These often are high frequency of use commands where low use would still leverage large language model. You have an opportunity here to leverage this workflow. Being able to create commands with large language model is not something many projects have explored.

Centigonal

0 replies

23h25m

2024-03-26 19:01:13 UTC

Great product idea, excellent demo. Fantastic use case for LLMs. Keep it up!

C-Loftus

0 replies

2h14m

2024-03-27 16:11:31 UTC

Nice work. I am very involved in the Talon community and it is cool to see other projects tackling voice interaction from different perspectives.

I develop a very similar natural language voice interaction tool using the OpenAI API and Talon as the engine[0]. (i.e you apply any voice command transformation with AI on any text, or use it alongside Cursorless for semantically targeting scopes in the AST) You can use my solution with offline LLM models too.

If you are interested in chatting, please reach out[1], as I am very interested and experienced in this space.

[0] https://github.com/c-loftus/talon-ai-tools

[1] https://colton.bio/contact/

Arctic_fly

0 replies

20h39m

2024-03-26 21:46:59 UTC

I remember hearing about Dragon when I was in elementary school. It's cool to reflect on how far things have progressed in the last decade and a half.

35mm

0 replies

1d2h

2024-03-26 15:43:24 UTC

I tried the demo, it worked well, allowing me to add a line and then delete the first line - a test that Dragon or Apple would have failed.

What does the actual app look like though? Is it only in a browser or can I use this anywhere on my Mac?