Are there any open-source or paid apps/shareware/freeware that can:
- Transcribe word-by-word in real time as audio is recorded
- Work entirely locally
- Use relatively recent open-source local models?
I've been using otter.ai for real-time meeting transcriptions - letting me multitask and instantly catch up if I'm asked a question by skimming the most recent few seconds worth of the transcript - but it's far from perfect and occasionally their real-time service has significant transcription delays, not to mention it requires internet connectivity.
Most of the Whisper-based apps out there, though, as well as (when I last checked) the whisper.cpp demo code, require an entire recording to be ingested at once. There are others that rely on e.g. Apple's dictation frameworks, which is a bit dated in capability at the moment.
Anything folks are using out there?
I have built my own local-first solution to transcribe entirely locally in real time word by word, driven by a different need (I'm hard of hearing). It's my daily driver for transcribing meetings, interviews, etc. Because of its local-first capability, I do not have to worry about privacy concerns when transcribing meetings at work as all data stays on my machine. It's about as fast as Otter.ai although there's definitely room for improvements in terms of UX and speed. Caveat is that it only works on MacBooks with Apple silicon. Happy to chat over email (see my HN profile).
I have some staff with combined hearing and visual needs. Have you researched the one-, two- all-party consent requirements? Asking because I hope to identify transcription as "non-recording".
California has an exception for hearing aids and other similar devices, but it’s unclear if transcription aids count, or if this has been tested in court. https://codes.findlaw.com/ca/penal-code/pen-sect-632/ (Not a lawyer, this is not legal advice.)
If it were ephemeral? Would that change this? Say recording the meeting locally a 5 minute frame then updating a meeting summary?
Do you mean ephemeral, or are you actually wondering about something implanted under the skin? I'd think/hope if it goes under the skin, it ends up in "hearing aid" territory. I'm less sure about if it doesn't persist.
Yup, typo, sorry
Two/all-party consent are hacky workarounds for the actual harm being inflicted (valid goals including not having your microwave inform Google's ad servers, not recording out-of-context jokes as evidence to imprison people, ... -- invalid goals caught up in the collateral damage include topics like the current one about hearing issues (note that a sufficiently accurate transcription service has all the same privacy problems 2-party consent tries to protect against, maybe more since it's more easily searchable)).
I'd be in favor of some startup pulling an Uber or AirBnB and blatantly violating those laws to the benefit of the deaf or elderly if it meant we could get something better on the books.
What did your own research turn up?
I was so excited until the very end. I have the wrong hardware.
Google Pixel phones have this feature and it works _very_ well.
How is that feature accessed? Or what does Google call it so I can search for it.
Live Transcribe in the accessibility settings. AFAIK it's available on any fairly recent Android phone. I bought a Pixel tablet for no other reason but to run it -- nothing else I've tried comes close for local-only continuous transcribe-as-they-speak. (iOS has a similar feature also under accessibility; it's good but not at the same level. Of course I'd love to see an open-source solution.)
This was for English. One problem it took me a while to realize: when I switched it to transcribe a secondary language, it was not doing it on-device anymore. You can tell the difference by setting airplane mode.
There's a captioning button under the volume slider, and I think it's called "live captions" or something in settings. Just tap the button and it'll start.
https://support.google.com/accessibility/android/answer/9350...
Have you tried for non English languages?
New Microsoft Surfaces have this feature but just works for English
I helped coding oTranscribe+ [0], which does something similar to what you are asking for. Using ElectronJS and the current, at that moment, version of oTranscribe, there is this desktop application. It also exists as web version and PWA [1].
Language models were those from BSC (Barcelona Supercomputing Center) at the time. The transcription is done via WASM, using Vosk [2] as base.
I hope it fits.
[0] https://github.com/projecte-aina/oTranscribe-plus [1] https://otranscribe.bsc.es/ [2] https://github.com/alphacep/vosk-api
Is there a way to get it to punctuate? Or does it only jot down words?
I've been using Transcribro[0] on Android/GrapheneOS. It's FOSS and only local, and while it's not word-for-word real-time, it doesn't have to wait for the whole audio to be uploaded before it can work. This is on a Pixel 5a, so hardly impressive hardware.
It works well enough that I use it with Telegram to shove messages over to my Linux machine when I don't feel like typing them out, which is such an unsophisticated hack, but is getting the job done. I spent a couple hours trying to find a Linux-native alternative, or even get this running in Waydroid, and couldn't find anything that worked as well, so I decided not to let the "smooth" become the enemy of the "good enough to get the job done."
[0] https://github.com/soupslurpr/Transcribro
futo.org has FOSS voice input android app (voiceinput.futo.org) and live captions (https://github.com/abb128/LiveCaptions) for Linux. They specifically developed their own model that does fast real time transcriptions.
Not sure if that helps for your specific usecase.