HN comments for: OTranscribe: A free and open tool for transcribing audio interviews

btown

18 replies

3h8m

2024-08-09 15:23:04 UTC

Are there any open-source or paid apps/shareware/freeware that can:

- Transcribe word-by-word in real time as audio is recorded

- Work entirely locally

- Use relatively recent open-source local models?

I've been using otter.ai for real-time meeting transcriptions - letting me multitask and instantly catch up if I'm asked a question by skimming the most recent few seconds worth of the transcript - but it's far from perfect and occasionally their real-time service has significant transcription delays, not to mention it requires internet connectivity.

Most of the Whisper-based apps out there, though, as well as (when I last checked) the whisper.cpp demo code, require an entire recording to be ingested at once. There are others that rely on e.g. Apple's dictation frameworks, which is a bit dated in capability at the moment.

Anything folks are using out there?

uohzxela

8 replies

3h2m

2024-08-09 15:29:03 UTC

I have built my own local-first solution to transcribe entirely locally in real time word by word, driven by a different need (I'm hard of hearing). It's my daily driver for transcribing meetings, interviews, etc. Because of its local-first capability, I do not have to worry about privacy concerns when transcribing meetings at work as all data stays on my machine. It's about as fast as Otter.ai although there's definitely room for improvements in terms of UX and speed. Caveat is that it only works on MacBooks with Apple silicon. Happy to chat over email (see my HN profile).

WaitWaitWha

6 replies

2h34m

2024-08-09 15:56:59 UTC

I have some staff with combined hearing and visual needs. Have you researched the one-, two- all-party consent requirements? Asking because I hope to identify transcription as "non-recording".

btown

3 replies

2h28m

2024-08-09 16:03:16 UTC

California has an exception for hearing aids and other similar devices, but it’s unclear if transcription aids count, or if this has been tested in court. https://codes.findlaw.com/ca/penal-code/pen-sect-632/ (Not a lawyer, this is not legal advice.)

noah_buddy

2 replies

1h45m

2024-08-09 16:46:10 UTC

If it were ephemeral? Would that change this? Say recording the meeting locally a 5 minute frame then updating a meeting summary?

smeej

1 replies

1h30m

2024-08-09 17:01:25 UTC

Do you mean ephemeral, or are you actually wondering about something implanted under the skin? I'd think/hope if it goes under the skin, it ends up in "hearing aid" territory. I'm less sure about if it doesn't persist.

noah_buddy

0 replies

1h1m

2024-08-09 17:29:56 UTC

Yup, typo, sorry

hansvm

0 replies

1h2m

2024-08-09 17:29:42 UTC

Two/all-party consent are hacky workarounds for the actual harm being inflicted (valid goals including not having your microwave inform Google's ad servers, not recording out-of-context jokes as evidence to imprison people, ... -- invalid goals caught up in the collateral damage include topics like the current one about hearing issues (note that a sufficiently accurate transcription service has all the same privacy problems 2-party consent tries to protect against, maybe more since it's more easily searchable)).

I'd be in favor of some startup pulling an Uber or AirBnB and blatantly violating those laws to the benefit of the deaf or elderly if it meant we could get something better on the books.

CyberDildonics

0 replies

37m

2024-08-09 17:54:20 UTC

What did your own research turn up?

smeej

0 replies

1h31m

2024-08-09 17:00:13 UTC

I was so excited until the very end. I have the wrong hardware.

baby_souffle

4 replies

2h46m

2024-08-09 15:45:19 UTC

Are there any open-source or paid apps/shareware/freeware

Google Pixel phones have this feature and it works _very_ well.

ericjmorey

2 replies

1h51m

2024-08-09 16:40:51 UTC

How is that feature accessed? Or what does Google call it so I can search for it.

abecedarius

0 replies

59m

2024-08-09 17:32:02 UTC

Live Transcribe in the accessibility settings. AFAIK it's available on any fairly recent Android phone. I bought a Pixel tablet for no other reason but to run it -- nothing else I've tried comes close for local-only continuous transcribe-as-they-speak. (iOS has a similar feature also under accessibility; it's good but not at the same level. Of course I'd love to see an open-source solution.)

This was for English. One problem it took me a while to realize: when I switched it to transcribe a secondary language, it was not doing it on-device anymore. You can tell the difference by setting airplane mode.

Groxx

0 replies

1h43m

2024-08-09 16:48:33 UTC

There's a captioning button under the volume slider, and I think it's called "live captions" or something in settings. Just tap the button and it'll start.

https://support.google.com/accessibility/android/answer/9350...

neves

0 replies

2h15m

2024-08-09 16:16:40 UTC

Have you tried for non English languages?

New Microsoft Surfaces have this feature but just works for English

alfredgg

1 replies

1h28m

2024-08-09 17:03:21 UTC

I helped coding oTranscribe+ [0], which does something similar to what you are asking for. Using ElectronJS and the current, at that moment, version of oTranscribe, there is this desktop application. It also exists as web version and PWA [1].

Language models were those from BSC (Barcelona Supercomputing Center) at the time. The transcription is done via WASM, using Vosk [2] as base.

I hope it fits.

[0] https://github.com/projecte-aina/oTranscribe-plus [1] https://otranscribe.bsc.es/ [2] https://github.com/alphacep/vosk-api

smeej

0 replies

2024-08-09 18:26:49 UTC

Is there a way to get it to punctuate? Or does it only jot down words?

smeej

0 replies

1h32m

2024-08-09 16:59:30 UTC

I've been using Transcribro[0] on Android/GrapheneOS. It's FOSS and only local, and while it's not word-for-word real-time, it doesn't have to wait for the whole audio to be uploaded before it can work. This is on a Pixel 5a, so hardly impressive hardware.

It works well enough that I use it with Telegram to shove messages over to my Linux machine when I don't feel like typing them out, which is such an unsophisticated hack, but is getting the job done. I spent a couple hours trying to find a Linux-native alternative, or even get this running in Waydroid, and couldn't find anything that worked as well, so I decided not to let the "smooth" become the enemy of the "good enough to get the job done."

[0] https://github.com/soupslurpr/Transcribro

andrei-akopian

0 replies

2h22m

2024-08-09 16:09:07 UTC

futo.org has FOSS voice input android app (voiceinput.futo.org) and live captions (https://github.com/abb128/LiveCaptions) for Linux. They specifically developed their own model that does fast real time transcriptions.

Not sure if that helps for your specific usecase.

cube2222

15 replies

10h2m

2024-08-09 08:29:54 UTC

I needed to do this this week (transcribe an interview with multiple speakers) and used https://github.com/MahmoudAshraf97/whisper-diarization

Worked excellent.

It generates both a file that just contains a line per uninterrupted speaker speech prefixed with the speaker number, as well as a file with timestamps which I believe would be used as subtitles.

wanderingmind

4 replies

7h38m

2024-08-09 10:53:40 UTC

The problem with using OpenAI whisper is that its too slow on CPU only machines. Whisper.CPP is blazing fast compared to Whisper and I wish people build better diarization on top of that.

stavros

2 replies

7h3m

2024-08-09 11:28:12 UTC

What's OpenAI Whisper vs whisper.cpp? Do you mean whisper-diarization uses the API?

Zambyte

1 replies

6h51m

2024-08-09 11:40:30 UTC

https://github.com/openai/whisper

https://github.com/ggerganov/whisper.cpp

They are two inference engines for running the whisper ASR model, each with their own API AFAIK.

stavros

0 replies

6h49m

2024-08-09 11:42:00 UTC

Ah I see, thanks. Hm, I would imagine that it's not hard to make something that works with both (the surface area of the API should be fairly small, I imagine), odd that projects use the former and not the latter.

aidenn0

0 replies

3h10m

2024-08-09 15:21:22 UTC

Another advantage of Whisper.CPP is that it can use cublas to accelerate models too large for your GPU memory; I can run the medium and large models with cublas on my 1050, but only the small if I use the pure GPU mode.

adipasquale

3 replies

9h47m

2024-08-09 08:44:15 UTC

I have had very good results using Spectropic [1], a hosted Whisper Diarization API service as a platform. I found it cheap and way easier and faster than setting up and using whisper-diarization on my M1. Audiogest [2] is a web service built upon Spectropic, I have not yet used it.

disclaimer : I am not affiliated in any way, just a happy customer! I had some nice mail exchanges after bug reports with the (I believe solo-)developer behind these tools.

---

[1] https://spectropic.ai/

[2] https://audiogest.app/

thomasmol

2 replies

7h23m

2024-08-09 11:08:29 UTC

Thanks for the shout-out and kind words!

Thomas here, maker of Spectropic and Audiogest. I am indeed focused on building a simple and reliable Whisper + diarization API. Also working on providing fine-tuned versions of Whisper of non-English languages through the API.

Feel free to reach out to me if anyone is interested in this!

dchuk

1 replies

4h1m

2024-08-09 14:30:36 UTC

Great looking API. Are you able to, or do you have plans, for there to be automatic speaker identification based on labeled samples of their voices? It would be great to basically have a library of known speakers that are auto matched when transcribing

thomasmol

0 replies

2h34m

2024-08-09 15:57:19 UTC

Thanks! That is something I might offer in the future and is definitely possible with a library like pyannote. Would be really cool to add for sure.

I am also experimenting with post-processing transcripts with LLMs to infer speaker names from a transcript. It works pretty decent already but it's still a bit expensive. I have this feature available under the 'enhanced' model if you want to check it out: https://docs.spectropic.ai/models/transcribe/enhanced

RamblingCTO

3 replies

9h57m

2024-08-09 08:34:38 UTC

I had better success with whisperx, as whisper-dia does sometimes have weird issues I couldn't resolve: https://github.com/m-bain/whisperX

cube2222

2 replies

8h39m

2024-08-09 09:51:57 UTC

iirc whisper-diarization uses whisperx under the hood.

I’ll be honest, I haven’t dived much into this as I just needed something transcribed quickly, but when I was looking at WhisperX I couldn’t find a CLI that would just out of the box give me a text file with a line per speaker statement (not per word).

stavros

0 replies

6h56m

2024-08-09 11:35:08 UTC

iirc whisper-diarization uses whisperx under the hood.

It seems like it does:

https://github.com/MahmoudAshraf97/whisper-diarization/blob/...

RamblingCTO

0 replies

7h24m

2024-08-09 11:07:08 UTC

I use it like this:

whisperx $file int8 --min_speakers 3 --max_speakers 3 --language de --hf_token $token --diarize

hubraumhugo

0 replies

3h36m

2024-08-09 14:55:09 UTC

Fascinating how traditionally very complex and hard ML problems are slowly becomming commodities with AI:

- transcription

- machine translation

- OCR

- image recognition

H8crilA

0 replies

9h32m

2024-08-09 08:59:28 UTC

I often subtitle old, obscure, foreign language movies with Whisper. Or random clips found on foreign Telegram/Twitter channels. Paired up with some GPT for translation it works great!

You can do this locally if you have enough (V)RAM, but I prefer the OpenAI API, as usually I don't have enough at hand. And the various Llamas aren't really quality on par with GPT-4. If you only need Whisper, and no translation, then local execution is indeed very viable. High quality Whisper fits in 4GB of (V)RAM.

jagermo

5 replies

10h22m

2024-08-09 08:09:39 UTC

fantastic tool; I used it a lot to transcribe interviews during plane travels where there was no internet, and I needed to fill the time. Really useful to have if you do a lot of interviews

dotancohen

4 replies

8h28m

2024-08-09 10:03:06 UTC

From the homepage:

A free web app to take the pain out of transcribing recorded interviews

How did you use a web app on the plane with no internet?

tampueroc

0 replies

8h12m

2024-08-09 10:19:28 UTC

The web app saves an offline copy for use the first time you open it. https://otranscribe.com/help/#can_i_use_otranscribe_offline

jagermo

0 replies

7h13m

2024-08-09 11:18:04 UTC

it works offline if you preload the website :)

grandfunction

0 replies

8h21m

2024-08-09 10:10:26 UTC

Ran the server on his or her laptop...

You don't need the internet to use a web browser

Havoc

0 replies

8h21m

2024-08-09 10:10:06 UTC

It’s MIT licensed so presumably self hosted

kimoz

3 replies

7h51m

2024-08-09 10:40:22 UTC

Anyone knows a free tool for generating subtitles for movies and series videos ?

drtgh

0 replies

3h25m

2024-08-09 15:06:53 UTC

SubtitleEdit is one of the most complete and has many online tutorials from users.

Make sure they are recent tutorials because they will probably mention how to use the automated generation tools/plugins that wasn't available years ago.

https://github.com/SubtitleEdit/subtitleedit

doug_life

0 replies

7h18m

2024-08-09 11:13:15 UTC

https://github.com/McCloudS/subgen worked very well for me. I had a TV series where somehow the last few seasons timestamps did not match up with subtitle files I could find online. I used subgen and it worked surprisingly well.

BrunoJo

0 replies

6h22m

2024-08-09 12:09:16 UTC

You can try https://www.transcripo.com/ for free

TrojanHookworm

3 replies

10h10m

2024-08-09 08:21:19 UTC

Use this a lot. It's nice and simple and has exactly the tools you need (playback speed control, easy pause/play) and nothing more. Greatly prefer it over automatic transcription tools give you 40 pages of 'umm's and 'ahhhh's to filter through and edit.

stavros

2 replies

6h49m

2024-08-09 11:42:52 UTC

Can you not give the transcript to an LLM to remove the umms and ahhs?

BiteCode_dev

1 replies

5h56m

2024-08-09 12:35:51 UTC

People not used to AI have blind spots that prevent them from seing evident use case like this.

I'm always surprised at the amazed look of my friends when they see me concretely use the tool. They just didn't picture it until they saw it in action.

stavros

0 replies

5h50m

2024-08-09 12:40:57 UTC

It's not even people not used to AI, I developed a tool that uses AI to do something, and then kind of couldn't be bothered to fix some of the output manually. It only occurred to me days later that I can ask the AI to fix it.

leiferik

2 replies

3h21m

2024-08-09 15:10:29 UTC

You're always welcome to try my service TurboScribe https://turboscribe.ai/ if you need a transcript of an audio/video file. It's 100% free up to 3 files per day (30 minutes per file) and the paid plan is unlimited and transcribes files up to 10 hours long each. It also supports speaker recognition, common export formats (TXT, DOCX, PDF, SRT, CSV), as well as some AI tools for working with your transcript.

rsingel

1 replies

2h29m

2024-08-09 16:02:54 UTC

This looks great. Did you have an API or plan to release one?

leiferik

0 replies

2h18m

2024-08-09 16:13:34 UTC

Thanks! Nothing to announce on the API front right now, but appreciate you asking :)

jrochkind1

2 replies

5h59m

2024-08-09 12:32:37 UTC

Kinda surprised to not have AI integration.

You do still need to proof and QA even AI results, if you want a publication quality result, and do things like attribute who is speaking when (at least Whisper can't do that), and correct "unusual" last names and things. So I feel like people using AI still need good tools for the correcting/finishing/proofing too, that would be similar to the tools for non-assisted transcription.

MattieTK

1 replies

5h22m

2024-08-09 13:09:00 UTC

This was written a really long time ago by a former WSJ Graphics reporter (Elliot Bentley) who is now at Datawrapper.

It is now operated by Muckrock and hasn't seen changes made to it in a while.

That's why it doesn't have any of these integrations, the technology just didn't exist.

jrochkind1

0 replies

5h3m

2024-08-09 13:28:29 UTC

Aha, good to know! That's actually important context, that this is not a recent release, and doesn't necessarily have a lot of ongoing development.

choya-love

2 replies

9h23m

2024-08-09 09:08:02 UTC

Any new language support in the future? Fingers crossed for japanese

fabianmg

0 replies

8h25m

2024-08-09 10:06:43 UTC

Am I missing something?. For what I checked it supports every language, as is yourself the one transcribing by hand. This is just an UI to watch the video or audio while you're typing it.

comradesmith

0 replies

6h1m

2024-08-09 12:30:12 UTC

https://tactiq.io is made for meetings, but also does uploaded transcripts and supports Japanese!

accidbuddy

2 replies

4h7m

2024-08-09 14:24:06 UTC

Anyone knows one with transcription and translate in real time?

Nowadays, I use libretranslate/libretranslate and pluja/whishper to do this, but not at real time.

Bayko

1 replies

4h3m

2024-08-09 14:28:41 UTC

Ah this brings back memories. When I was in college with limited money, I used to pirate movies and most of them didn't have subtitles and I used to daydream of writing a VLC plug-in which would real time generate subtitles. But I had better things to do like play video games...

space_oddity

0 replies

1h38m

2024-08-09 16:53:04 UTC

Many of us have had those ambitious tech ideas...

BetterWhisper

2 replies

8h20m

2024-08-09 10:11:21 UTC

If you are looking for something automatic that also allows you to interact with your transcripts chatgpt style then I would recommend https://www.videototextai.com/

Terretta

1 replies

5h12m

2024-08-09 13:19:46 UTC

That cookies box though... Dark pattern (accept lots + accept all, fake drag affordance, covering a quarter of the page) for cookies doesn't bode well for privacy protections around the transcripts.

BetterWhisper

0 replies

2h2m

2024-08-09 16:29:49 UTC

You are allowed to delete any transcription you make and with that we do not keep any copy of the transcripts :) . The cookie banner is there to comply with the EU laws.

nullbar

1 replies

8h54m

2024-08-09 09:37:13 UTC

Maybe it isn't perfectly clear, but OTranscribe isn't an automatic speech-to-text tool, but instead, a UI for assisting in manual transcribing.

So no AI here, folks.

space_oddity

0 replies

1h42m

2024-08-09 16:49:28 UTC

Yep, it's designed to assist with manual transcription

neves

1 replies

2h23m

2024-08-09 16:08:44 UTC

Does anybody tested it with Brazilian Portuguese? It is a hard problem, since we have too many accents.

dmd

0 replies

1h37m

2024-08-09 16:54:08 UTC

I don't understand what the issue is. You don't know how to type the different diacritical marks? Or the textbox isn't accepting them? (Which seems like it would be a browser issue, not an issue with the site.)

bcherny

1 replies

5h6m

2024-08-09 13:25:21 UTC

Looks cool! Unclear from the docs, but does it support non-English languages? How about mixed-language interviews?

avodonosov

0 replies

4h45m

2024-08-09 13:46:15 UTC

Yes! Any language you understand is supported!

ulrischa

0 replies

2h58m

2024-08-09 15:33:00 UTC

Pretty amazing what a webapp an do. I whished there were more lile them and not all these native apps

tkgally

0 replies

4h47m

2024-08-09 13:44:19 UTC

I was curious how good a transcription I could get from what may be the best multimoldal LLM currently, Gemini-1.5-Pro-Experiment-0801, so I had it transcribe five minutes of an interview between Ezra Klein and Nancy Pelosi from earlier today. The results are here:

https://www.gally.net/temp/20240809geminitranscription/index...

Aside from some minor punctuation and capitalization issues, Gemini’s transcription looks nearly perfect to me. There were only one or two words that I think it misheard. If I had transcribed the audio myself, I would have made more mistakes than that.

One passage struck me in particular:

  And then he comes up with "weird," which becomes viral and the rest, and here he is.

How did Gemini know to put “weird” in quotation marks, to indicate—correctly—that the speaker was referring to Walz’s use of the word as a word? According to Politico, Walz first used the word in that context in the media on July 23.

https://www.politico.com/news/2024/07/26/trump-vance-weird-0...

teddyh

0 replies

7h42m

2024-08-09 10:49:50 UTC

See also TranscriberAG: <https://transag.sourceforge.net/>

space_oddity

0 replies

1h43m

2024-08-09 16:48:02 UTC

oTranscribe is a free option for transcription but in many cases it's just too simple

phoronixrly

0 replies

10h7m

2024-08-09 08:24:34 UTC

https://github.com/oTranscribe/oTranscribe

matejmecka

0 replies

2h57m

2024-08-09 15:34:46 UTC

Just pitching in a transcription tool that lets you transcribe video and audio files using Whisper and WASM in your browser, and get a .txt, .srt, .vtt file. Maybe in the future support for Whisper Turbo?

https://video2srt.ccextractor.org/

Disclaimer: Working on this project.

kgdiem

0 replies

4h16m

2024-08-09 14:15:03 UTC

I started making an open source macOS app to do this with whisper and potentially pyannote.

It is functional but a bit slow. I think using whisper directly instead of swift bindings will help a lot.

Really interested in adding diarisation but having a lot of trouble converting Pyannote to CoreML. Pyannote runs so slowly with torch on CPU. Haven’t gotten around putting my latest work for that on GitHub yet.

Happy to accept contributions —

Some priorities right now:

* Fixing signing for local builds

* Replace swift whisper with whisper cpp

* Allowing users to provide their own models

https://github.com/Stack-Studio-Digital-Collective/Auditif

justinclift

0 replies

3h15m

2024-08-09 15:16:13 UTC

From their FAQ:

    Does oTranscribe automatically convert audio into text?
    
    Sorry! It doesn’t. oTranscribe makes the manual task of transcribing
    audio a lot less painful. But you still have to do the transcription.

ilt

0 replies

9h6m

2024-08-09 09:25:09 UTC

I currently use Aiko’s free iOS app which does offline transcription using OpenAI’s Whisper model. It has been working pretty well for me so far. It can export in formats like SRT, TXT, CSV, JSON and text with timestamps too. https://sindresorhus.com/aiko

dmitrykan

0 replies

6h35m

2024-08-09 11:56:43 UTC

I'm working on the tool, that includes AI. My original target is to test it on my https://www.youtube.com/c/VectorPodcast by offering something that Lex Fridman does for his episodes.

Current features: 1. Download from YT 2. Transcribe using Vosk (output has time codes included) 3. Speaker diarization using pyannote - this isn't perfect and needs a bit more ironing out.

What needs to be done: 4. Store the transcription in a search engine (can include vectors) 5. Implement a webapp

If anyone here is interested to join forces, let me know.

ciaran00

0 replies

5h51m

2024-08-09 12:40:53 UTC

Talio.ai allows you to do this with chatGPT style chat with the transcript plus numerous other features https://talio.ai

bilater

0 replies

39m

2024-08-09 17:52:39 UTC

If you just want quick transcriptions of YouTube video this works pretty well https://www.you-tldr.com/

avodonosov

0 replies

4h52m

2024-08-09 13:39:38 UTC

I made a similar tool for making tables of contents for youtube videos: https://youtoc.by/

Not developing it actively after I created tables of contents for the several videos I needed, years ago. If I ever need it again, I will probably work on mobile UI (aka responsive)