HN comments for: Audapolis: Edit audio files by transcript, not waveform

iainctduncan

20 replies

2d1h

2024-07-22 17:12:49 UTC

IMHO you should really change the headline on this. I'm an audio person, and my first thought was "that's stupid, words are awful at describing sound". But then I looked, and editing transcriptions of voice recordings by word is actually a great idea. That was not the impression the headline gave me, FWIW!

kajecounterhack

13 replies

2024-07-22 18:19:55 UTC

I'm also an audio person and I understood it just fine, so whatevs.

iainctduncan

12 replies

1d23h

2024-07-22 18:41:58 UTC

If you're trying to get attention, copy should be clear to all readers. The fact that you did not misread it in no way demonstrates that others won't.

And why the rude response?

kajecounterhack

10 replies

1d23h

2024-07-22 19:11:22 UTC

You're being insecure. It's not rude to disagree.

Also, there's often no perfect combo of words, there's a spectrum of options and you just pick an operating point. Transcription is a longer word than "word" so there's a tradeoff. It doesn't feel like a chasm to me.

dialsMavis

5 replies

1d22h

2024-07-22 20:15:18 UTC

I’m genuinely curious what you were trying to convey by completing your, totally valid, disagreement with “so whatevs”? I believe this is the part that’s perceived as rude because the expansion of that, “whatever”, is often further expanded as the sarcastic form of “whatever you say”.

kajecounterhack

4 replies

1d19h

2024-07-22 22:54:18 UTC

I was going for something between YMMW and "whatever you say." The slight tilt toward the latter was received poorly ¯\_(ツ)_/¯ It's just how I talk. Maybe it's generational.

DidYaWipe

3 replies

1d18h

2024-07-22 23:37:04 UTC

Nah. It's just being a condescending dick.

poincaredisk

2 replies

1d17h

2024-07-23 01:26:20 UTC

To be fair, I didn't read it as condescending. This I understood it as a generic statement ender, like "idk" or "ymmv" or "i guess", i.e. something that you put at the end of a sentence when you don't know what more to say. Maybe it actually is generational?

kajecounterhack

0 replies

23h39m

2024-07-23 18:53:55 UTC

Thanks that was my understanding too, but I hurt the boomers with "whatevs" lol

If it makes y'all feel better my gen alpha kid will hurt my feelings in different ways and get revenge for you =_=

DidYaWipe

0 replies

1d14h

2024-07-23 04:22:06 UTC

If you really want to be that charitable.

iainctduncan

1 replies

1d23h

2024-07-22 19:22:29 UTC

Uh, no I'm not. In my work world, disagreeing with "whatevs" would be considered rude and dismissive and would be called out.

Believe me, I don't care that you disagree. I just don't like to see people breaking the civility guidelines here as it's just about one of the last places online where discourse is largely held to a a civil level for disagreements.

I write copy professionally, among other things. If you don't care whether what you write is clear to almost all readers... then I suppose it doesn't matter. Most people do not want misunderstandings of their copy and most copy editors would flag that as unclear. The new version is much better.

kajecounterhack

0 replies

1d19h

2024-07-22 22:56:47 UTC

I just don't like to see people breaking the civility guidelines here as it's just about one of the last places online where discourse is largely held to a a civil level for disagreements.

I seriously disagree that this breaks any sort of social contract between you and I on the internet. It was intended to be mildly dismissive but not overly rude. There's a higher standard for communicating with care at work (you should care about your coworkers), but do you really think people on the internet have time for this shit? I don't know you guy.

bryanrasmussen

1 replies

1d9h

2024-07-23 09:30:43 UTC

in some context "whatever" can be it evens out, but in others it can be "your opinion doesn't matter".

At any rate when I first read it I thought it was going to be some sort LLM thing where you said "remove the third bridge and increase pitch by one octave in the outro" and it would give you back an edited mp4 which you could then listen and cringe to and sometimes say "whoa, that's amazing!"

iainctduncan

0 replies

1d3h

2024-07-23 14:48:21 UTC

lol, great description of what I thought too. Someone's going to do it...

IshKebab

0 replies

1d23h

2024-07-22 18:53:54 UTC

I also understood it fine, but maybe we both just remember the Adobe demo that vunderba mentioned. I guess it might not be so obvious if you don't know about that?

On the other hand it does say "not waveform" which I think makes it pretty clear. What would you suggest instead?

dang

3 replies

1d23h

2024-07-22 18:59:14 UTC

What would be a clearer title?

TheRealPomax

2 replies

1d23h

2024-07-22 19:10:31 UTC

"[...] by transcript, not waveform".

iainctduncan

0 replies

1d23h

2024-07-22 19:22:44 UTC

WAY better!

dang

0 replies

1d23h

2024-07-22 19:16:58 UTC

Done. Thanks!

RockRobotRock

1 replies

1d23h

2024-07-22 18:51:51 UTC

To me (not an audio person), it was pretty obvious that the headline meant editing voice recordings.

iainctduncan

0 replies

1d23h

2024-07-22 19:20:21 UTC

It's not at all obvious. Given what we have seen recently, an equally plausible interpretation is "talk to an LLM and it will edit your audio" where audio could be anything.

It's not a good idea, but then tons of the LLM ideas we see here aren't either.

vunderba

11 replies

2024-07-22 18:13:06 UTC

I remember when Adobe demoed this idea of being able to edit waveforms by the recognized text back in 2016 and it was pretty mind blowing for the time.

https://youtu.be/I3l4XLZ59iw

EDIT: I could also definitely see Audapolis being useful if you could integrate it into a podcast's post processing flow (volume normalization, de-essing) by recognizing certain verbal tics and automatically removing them from the audio such as "ummmm...", etc.

Philip-J-Fry

9 replies

1d23h

2024-07-22 19:20:05 UTC

What ever happened to that Adobe demo? Was that a real product at any point? It's quite amazing how ahead of its time it was. Now that we have AI making people say whatever we want, it felt like Adobe was on the cusp of that then.

codetrotter

7 replies

1d22h

2024-07-22 20:18:54 UTC

I remember people saying at the time that “this is the point at which voice recordings can not be trusted any longer”. And then, like you said nothing happened kind of for a few years until the current AI/ML tech got to where it is currently at.

jazzyjackson

6 replies

1d21h

2024-07-22 20:56:23 UTC

and there's still no commercial product for synthesizing video to sync lip movements to edited transcript like all the scary proof of concepts that turned the president into a puppet

Maybe there's not much value in editing what someone said after all

poincaredisk

3 replies

1d17h

2024-07-23 00:44:35 UTC

commercial value: no

criminal value... maybe?

There are absolutely scams right now that use deepfakes to trick people.

krisoft

1 replies

1d4h

2024-07-23 13:48:58 UTC

commercial value: no

Of course there is commercial value. The cost of reshooting video materials is huge. You made an advert mentioning 3 features, but by the time the product is about to be released one of them got dropped or even worse changed? Congrats, you need to get the talent and the studio rebooked, you need to find a new tech crew, who need to set up again. Probably things won't cut seamlessly so you need to re-record the whole thing.

scoot

0 replies

1d3h

2024-07-23 15:13:37 UTC

Potentially also for syncing lip movement for content dubbed in a foreign language. For me when watching foreign media dubbed to English the discrepancy is very noticeable and quite distracting, no matter how well the dub is written, performed, and edited to match the timing of the original.

hunter2_

0 replies

1d14h

2024-07-23 03:37:06 UTC

Yeah, it's a bit like asking why Microsoft doesn't make a BitTorrent client or why Chase doesn't offer a cryptocurrency. The prevailing use cases are just a bit too untoward, even if a few wholesome stories do exist.

omeze

0 replies

1d13h

2024-07-23 05:04:27 UTC

HeyGen allows you to do for this in a few ways

ipsum2

0 replies

1d11h

2024-07-23 07:27:09 UTC

This is used pretty often, you probably just don't notice it.

lofaszvanitt

0 replies

1d19h

2024-07-22 23:00:26 UTC

Were they strongarmed or self censored themselves? Would be interesting to know the backstory.

suchire

0 replies

1d2h

2024-07-23 15:49:35 UTC

This workflow is exactly what Descript does. Transcript-based editing, filler word removal, noise reduction, volume normalization, Overdub spoken word correction using the speaker’s voice, eye gaze correction for video, etc.

Disclaimer: I work at Descript

emadda

4 replies

2024-07-22 18:05:56 UTC

Nice, are there plans to notarize the mac app?

I built something similar here: https://bigwav.app

j4nt4b

3 replies

22h32m

2024-07-23 20:00:29 UTC

I just tried this out and it's very nice and easy to use. Thank you for sharing! I ended up copy-pasting the output from the messages page, which is 99% of the way to exporting a .txt file and my personal use case. Great work.

emadda

2 replies

8h33m

2024-07-24 09:59:02 UTC

Thanks for the feedback! Maybe I should add a "download as .txt".

What do you typically do with the text on export? E.g. Do you parse the times?

j4nt4b

1 replies

8h27m

2024-07-24 10:05:30 UTC

In my videography work I often do a separate audio-only interview to use as voice-over for the final video. I like to print out a transcript, mark the highlights, then go to the sound file and extract the snippets I liked. Extracting the snippets is a lot easier when I have timestamps printed out inline with the text at intervals of one or two minutes. In the case of bigWav, there were timestamps marked at only three or four points, so I had to go back and manually enter ten more marks to orient myself on the page. In addition, I used ChatGPT on an answer-by-answer basis to clean up the copy and add in punctuation for ease of reading. So there was an hour or two of data sanitizing needed to get everything ready to print out and use efficiently.

emadda

0 replies

1h45m

2024-07-24 16:47:17 UTC

Two features that may help in bigwav:

1) You can add more timestamps by adding paragraphs with enter.

2) You can playback at any word by highlighting it and pressing space. You can also cut with right click on the wav ui.

alsetmusic

4 replies

2024-07-22 17:46:59 UTC

One of the hosts of a podcast that I listen to has had positive things to say about DeScript.[0] Just mentioning it because he's been talking about it for a few years so I expect its had a good amount of feature development over time.

[0] descript.com/

mavsman

3 replies

2024-07-22 18:26:55 UTC

I love Descript. Their "convert to studio quality" feature is better than Adobe's and ElevenLabs, in my experience.

I wondered if this particular feature was really worth paying for so I was happy that I found Audapolis.

pimlottc

2 replies

1d23h

2024-07-22 19:15:48 UTC

What does that feature do?

omnimus

0 replies

1d11h

2024-07-23 07:07:45 UTC

Its machine learning powered noise reduction + compressor + eq + normalize combo effect. Works ok. Results in quite a bit overdone “studio” sound. I think trend in mixing is leaning much more natural (less tweaked) nowdays. But for no work it might be impressive. Probably works in internal corpo presentations well.

mavsman

0 replies

1d7h

2024-07-23 11:06:39 UTC

What I particularly like about the Descript version (though it is overdone as mentioned) is that it reduces or eliminates the pesky S sounds and P sounds (called sibilances and plosives) that you get when you talk into a microphone and you're not perfectly distanced from it.

I haven't found another app that reduces or removes these.

jdprgm

3 replies

1d21h

2024-07-22 20:51:06 UTC

This really needs a video demo or at least a more in depth text description of the features. Will download later to try but curious does this just do simple hard cuts on audio text or is there any ai magic for blending sentence timing if that makes sense?

A number of comments turned me onto Descript -- made a similar comment on another audio thread recently: drives me absolutely insane how all audio tools with any AI are web based monthly saas instead of offline private gpu upfront purchase.

aabhay

2 replies

1d20h

2024-07-22 21:37:33 UTC

The web based tools launch and move faster. There’s no lack of offline tools, if you’re the kind of person that files issue tickets in their spare time

DidYaWipe

1 replies

1d18h

2024-07-22 23:35:07 UTC

"if you’re the kind of person that files issue tickets in their spare time"

What does that have to do with non-Web-based applications?

DidYaWipe

0 replies

1d14h

2024-07-23 04:18:50 UTC

generalizations

3 replies

1d16h

2024-07-23 01:35:38 UTC

Combine this with the tech to generate new audio matching the speaker's voice profile, and you've really got something cool.

leumon

2 replies

1d8h

2024-07-23 09:51:52 UTC

It's difficult to do this for a video (and probably wouldn't look that nice)

phrotoma

0 replies

1d5h

2024-07-23 12:55:35 UTC

Descript does an okay job of this too.

generalizations

0 replies

1d3h

2024-07-23 15:20:34 UTC

This is for just audio, though?

StarterPro

3 replies

1d2h

2024-07-23 15:43:06 UTC

Call me a jerk, but anyone who is editing audio seriously, probably wants the waveform, no?

porkbeer

2 replies

1d2h

2024-07-23 16:20:22 UTC

Podcasters are much less picky, with much more audio to process. For music or film, I would strongly agree.

swyx

0 replies

2024-07-23 18:19:21 UTC

as podcaster, yup. chucking 2hrs of audio in descript and removing 700 ums is golden

pavel_lishin

0 replies

1d2h

2024-07-23 16:21:51 UTC

It's also probably Good Enough for a first pass-through.

I'm stuck in editing hell right now, and it would be very nice to just visually scroll past a few pages of pre-episode bullshitting and be able to wipe out whole minutes at a stretch, without having to listen to the whole thing. Even at increased speed, it's a bit of a slog.

jiehong

2 replies

2d1h

2024-07-22 17:13:11 UTC

That’s awesome!

Is 1 emoji for each commit title a new trend?

larrybolt

0 replies

2d1h

2024-07-22 17:30:39 UTC

I'm not sure how new the trend is, but it's called gitmoji (https://gitmoji.dev/) and there's also tooling to make committing/searching for the "correct" emoji easier :D Whatever makes your job more fun, right? Oh and it saves on characters!

DJiK

0 replies

2d1h

2024-07-22 17:32:46 UTC

Gitmoji has been around for eight years now. https://gitmoji.dev/

hammeiam

2 replies

2024-07-22 18:00:07 UTC

I've spent some of my free time over the past couple of months working on something similar. It's in a decent state but I need help from somebody who understands the .fcpxml format so you can export your edits to Davinci and FCP.

Take a look at https://matcha.video

alok-g

1 replies

2024-07-23 18:18:03 UTC

Looks useful. Does it export as a video file itself (e.g., mp4)? Thanks.

hammeiam

0 replies

2h6m

2024-07-24 16:26:30 UTC

Right now it exports a .fcpxml file which you would import into you editor (davinici, final cut etc) which includes all of the cuts you made. And from there you could move things around, add effects, do color grading, whatever you need to do to get to a final product.

Machado117

2 replies

1d8h

2024-07-23 09:56:06 UTC

The other day I was using the voice memos app on iOS 18 and was surprised to find that it also supports editing the recording by transcript

mavsman

1 replies

1d6h

2024-07-23 12:00:28 UTC

I just upgrade to iOS 18 to try this and couldn't find it. How do you actually do it?

Machado117

0 replies

22h18m

2024-07-23 20:14:51 UTC

I think it generates the transcript automatically for new recordings but you can also edit a old one and then generate the transcript from there

MForster

2 replies

1d23h

2024-07-22 18:53:23 UTC

And here I was expecting that I could edit the text and the app would change the audio file to say what I had typed...

MikeTheGreat

1 replies

1d21h

2024-07-22 20:44:59 UTC

Can I ask what this tool does? I was trying to figure it out (the GitHub page isn't terribly clear) and came to the same conclusion you did (delete a chunk of the transcript and the tool would delete that audio).

I think I just lack experience in this area. I've used Audacity to cut out parts of audio / splice together two clips and that's about it, so I clearly don't have enough background to understand what this tool does.

Can someone clarify what this tool does, please? :)

imp0cat

0 replies

1d11h

2024-07-23 06:33:15 UTC

It does exactly what you think it does. You can cut parts of the original file without having to edit the waveform (like you would in Audacity). Instead, select the parts directly just like you would in a text editor.

What it does not do is generate new words (ie you type a sentence and it adds that to your file as voice).

pryelluw

1 replies

1d23h

2024-07-22 18:44:19 UTC

If the maintainer is reading, having a demo video would be nice.

pragmatick

0 replies

1d8h

2024-07-23 09:45:27 UTC

https://news.ycombinator.com/item?id=41039955

frakkingcylons

1 replies

1d23h

2024-07-22 18:55:06 UTC

Somewhat off-topic: I saw the funding note at the bottom - it’s pretty cool that the German government is giving some funding to projects like this. I wonder how much the US is doing in that regard, like if there’s a list of projects that tax dollars goes towards.

vinniep1

0 replies

1d20h

2024-07-22 22:27:14 UTC

You can find some answers to that here: https://www.nsf.gov/

corn13read2

1 replies

1d12h

2024-07-23 06:32:40 UTC

This is pretty dated and doesn't support whisper which is the de-facto speech recognition model currently

alok-g

0 replies

2024-07-23 18:11:53 UTC

Does it allow using Whisper separately and importing? Sounds like not.

raymond_goo

0 replies

1d21h

2024-07-22 21:21:14 UTC

Demo Video: https://pajowu.de/audapolis_intro.mp4

petarb

0 replies

2d1h

2024-07-22 17:11:38 UTC

This is awesome to see as an open source project.

This functionality is some of my favorite when editing videos in Descript. It’s so much easier than chopping up waveforms in Audacity

leetrout

0 replies

2024-07-22 18:27:51 UTC

Hindenburg also added this capability.

Hindenburg’s manuscript feature gives you a complete overview of your audio. You can select the text just as you would in a text document and watch as your edits are made in real-time. If you need to export your text in a specific format, no problem. Hindenburg supports the most common text and transcription export formats.

https://hindenburg.com/

j45

0 replies

1d4h

2024-07-23 14:18:46 UTC

This is exciting to see - it seems the last release of was a year ago.

Can anyone clarify if this project is active?

geekodour

0 replies

2024-07-22 17:49:35 UTC

this looks great! will try out. I built a similar but very scrappy tool for the same usecase last year, I'd probably not build it if i found this.

[0] https://github.com/geekodour/wscribe-editor

bluelightning2k

0 replies

1d22h

2024-07-22 19:40:05 UTC

A genuinely free alternative to Descript sounds very useful.

I've always liked the idea of Descript and was considering building something similar before it came out. The problem is my use case is a couple of videos a year so doesn't fit with an expensive monthly subscription