return to table of content

Show HN: AI dub tool I made to watch foreign language videos with my 7-year-old

sorenjan
33 replies
1d18h

I know Germany dub most video, but wouldn't a seven year old be able to read subtitles? It's a great way for her to learn English, it's how most Swedes learn it before starting school. I think there's a pretty strong correlation between countries' average English proficiency and how common dubbing is.

https://haonowshaokao.com/2013/05/18/does-dubbing-tv-harm-la...

Edit: I forgot to mention that the samples on the website is impressive and well made. How do you do the speaker diarization and voice cloning?

wodenokoto
14 replies
1d13h

but wouldn't a seven year old be able to read subtitles?

No, they wouldn't.

I don't believe that most swedes learn English by reading subtitles before starting school.

I think there's a pretty strong correlation between countries' average English proficiency and how common dubbing is.

That I agree with.

NicoJuicy
5 replies
1d12h

Most people in Belgium learn English through that before school.

Why wouldn't swedes?

wodenokoto
2 replies
1d10h

You are saying that _most_ kids in Belgium can read subtitles before they start school?

It took me several years of school before being able to read fast enough to follow along subtitles, and the same for everyone I know.

NicoJuicy
0 replies
1d6h

I think you're native English and have the associated bias concerning how it works in practice?

Since kids first learn their native language ( write, read and speak) in school and only years after then ( mostly), learn foreign languages.

When they learned to do it in their native language, they hear English spoken on tv with eg. Dutch subtitles and pick it up. Sometimes before they have English lessons.

Most kids, as such, know a fair amount of English before they have it ( = English) in school.

The Dutch subtitles isn't always a requirement though. Kids will pick it up in some shows, eg. Pokemon would be a good example if English spoken.

Freak_NL
0 replies
1d10h

They probably meant before they start learning English in primary school, not before they start school.

This used to be the case in the Netherlands too; I picked up a significant body of English from British TV series watched with subtitles as kid. Nowadays this advantage will probably be missed by most children, because the streaming services offer a lot of dubbed content, and you get to pick what you watch unless someone guides you. Subtitles can be avoided for longer.

duckmysick
1 replies
1d10h

Are you saying that kids of age six can understand and speak English at a basic level - say half way to A1?

Or is it just a basic familiarity (like a couple of most common words) and awareness that English exists?

EDIT: I see from a reply below by Freak_NL that it probably means before the kids start learning English at school. That makes more sense, as they would be older at that point.

kwhitefoot
0 replies
1d8h

as they would be older at that point.

I don't know about The Netherlands but here in Norway children start learning English as soon as they start school at the age of five or six. But quite likely many of them will have at least some English already because of English language television, computer games, etc.

konschubert
2 replies
1d6h

My 6 year old has been watching 20 minutes of cartoons every night for the past two years. This is the only exposure to the English language that she has ever had.

She has learned to understand what is said in the cartoons. Of course she misses some things, but it's surprising how much she gets.

Like, when I ask her "what did Bluey just say?", she can explain it.

Children's brains are awesome.

But actually, grown-ups can also pick up quite a lot if they actually immerse themselves.

mysterydip
1 replies
1d6h

Bluey is an excellent cartoon to do that with. Kudos!

konschubert
0 replies
1d6h

I just wish there was a way to buy the Australian original version as a download.

anhner
1 replies
1d8h

> but wouldn't a seven year old be able to read subtitles?

No, they wouldn't.

hard disagree

voidpointer
0 replies
1d4h

Reading speed at that age will vary greatly. Reading subtitles while also having to follow the picture takes away focus and that makes it hard much harder for an inexperienced reader. My daughter, who picked up reading very naturally would have been able to follow sub-titles at age 7 without much trouble. My younger, 7-yo son on the other hand, who is more average in reading ability wouldn't be able to keep up with subtitles yet. Average reading speeds at age 7 seem to be 60-100 words per minute where subtitles are more at the 100-150 words per minute range. So for above-average readers, it will be possible but for the average, they won't be able to keep up consistently.

vidarh
0 replies
1d6h

Subtitles in a foreign language? Probably not. Subtitles translated into their original language? I think it's probably an exaggeration that people have learnt it before starting school because it implies a lot about what learning it means, but picking up a number of words, sure.

ivanhoe
0 replies
1d4h

Young kids don't even need subtitles, their brains are wired to figure out spoken languages, after all that's how we all learn our mother tongue initially. Last summer my then 3.5 years old, to my huge surprise, started talking in (simple, but correct) English with some tourist kids she met in the park. We never spoke English in home with her before, so I presume she picked it up from youtube and her older brother, but I had no idea she can form full sentences - including conditionals and past tense. And at first she was a bit slow to express her self, but after a few hours of play with those kids she sounded totally relaxed and fluent.

input_sh
0 replies
1d10h

I don't believe that most swedes learn English by reading subtitles before starting school.

It's not about learning the language per se, it's about familiarizing yourself with the sound of the language, which then makes formal learning feel much more intuitive. English becomes an easy subject because you always feel a little ahead of the material. When faced with a "fill in the blank" type of questions, you're able to answer them by what feels right, even when you can't quite explain why it feels right.

It's why #1 rule of language learning at any stage in life is always gonna be immersing yourself with the language you want to learn, and by far the most effective way to immerse yourself (excluding moving to another country) is to consume content in your target language.

supafastcoder
12 replies
1d17h

I think it’s a cultural difference. I’m also from a non-dubbing country (Netherlands) and I can’t stand dubbed content either. On the other hand people tell me they can’t stand subtitles because it “reveals” what they’re going to say before they say it.

lukan
7 replies
1d17h

"people tell me they can’t stand subtitles because it “reveals” what they’re going to say before they say it."

I love watching movies in the original language, but this is something I hate as well, but something that can be avoided.

Some movies get it right, though. The timing, just the words that are spoken and even different colors for different persons speaking (very rare, cannot even remember where I have seen it). That should be standard, but with most movies you can be lucky if the subs even match the plot and do not reveal too much.

jeroenhd
2 replies
1d14h

Some of the best subtitles I've ever seen were on Tom Scott's YouTube channel. They use different colours, indicators for jokes and sarcasm, while also staying relatively close to what's actually been said. They're better than many big-budget movies and TV shows I've seen.

He talked about subtitling at some point, and I was surprised how cheap subtitling services are. I think he went beyond the price he mentioned, but it really made me question why big, profitable YouTube channels aren't spending the small change to do at least native language subtitles that Google can translate, instead of relying on YouTube's terrible algorithm

That said, Whisper seems to generate quite good subtitles that take short pauses for timing into account, but they're obviously neve going to be as good as a human that actually understands the context of what's being said.

thylacine222
1 replies
1d11h

Whisper can also generate timings at the word level, which you could use to make better-timed subtitles

leobg
0 replies
1d10h

Yes. But Whisper's word-level timings are actually quite inaccurate out of the box. There are some Python libraries that mitigate that. I tested several of them. whisper-timestamped seems to be the best one. [0]

[0] https://github.com/linto-ai/whisper-timestamped

leobg
1 replies
1d9h

That's a great use case for LLMs, actually. Translate the sentence only up to what has been said so far. Basically, a balance between translating word-for-word (perfect timing, but terrible grammar) and translating the whole sentence and/or thought (perfect grammar and meaning, but potentially terrible timing).

With the SRT file format for subtitles, I think, there's no reason why one couldn't make groups of words appear as they are spoken.

Actually, I have to do the same thing when generating the dubbed voices. Otherwise it feels as though the AI voice is saying something different than the person in the video, especially when the AI finishes speaking and you still hear some of the last words from the original speaker.

postexitus
0 replies
1d8h

Unfortunately not all languages follow the same sentence structure, so translating "up to what has been said so far" is not possible.

Assume 2 dramatic stops in an English sentence, and observe Turkish version. You can "I will.. go to.... the cinema" "Ben... sinemaya... gidecegim" (I .. to the cinema.. go)

I am sure there are smarter examples.

crtasm
1 replies
1d16h

different colors for different persons speaking

BBC iPlayer does this for some content, I don't know if it's ever on movies though.

masfuerte
0 replies
1d1h

It is. The iPlayer subtitles for Citizen Kane use colour to distinguish speakers.

alexdbird
1 replies
1d9h

I prefer subs over dubbing for foreign languages, but I cannot stand closed captions (for people who can’t hear at all) because having your eye drawn to the bottom of the screen for a description of something I don’t need to know about is horrible!

vidarh
0 replies
1d6h

Sometimes it's hilarious when they're trying to describe the dramatic tension from sounds or music, and "reveal" all the cliches, though. "Music swells to a tear-jerking crescendo"

vidarh
0 replies
1d6h

I'm Norwegian, and Norway used to be near-universally non-dubbing other than for TV for the very youngest children, and even then almost exclusively cartoons or stop motion etc. where it wasn't so jarring. But the target age of material being dubbed has crept up as it has become relatively-speaking cheaper to do compared to revenues generated in what is a tiny market.

The thing that annoys me the most about it is that it often alters the feel of the material. E.g. I watched Valiant (2005) with my son in Norwegian first, because he got it on DVD from his grandparents. He doesn't understand much Norwegian, but when he first got the DVD he was so little that it didn't matter. A few years later we watched the English language version.

It comes across as much darker in the English version. The voice acting is much more somber than the relatively cheerful way the Norwegian dub was done, and it while it's still a comedy, in comparison it feels like the Norwegian version obscures a lot of the tension, and it makes it feel almost like a different movie.

I guess that could go both ways, but it does often feel like the people dubbing something are likely to have less time and opportunity to get direction on how to play the part, and you can often hear the consequences.

matsemann
0 replies
1d6h

I think you get used to it. Like a punchline I've read, but I don't "register" it until the proper thing happens on the screen.

ChemSpider
1 replies
1d5h

Dubbing in Germany is horrible and pervasive. Even in the news and interviews. Subtitles are cheaper and better.

As others have said, it is better to expose kids (that can read) to the original language plus subtitles.

So in other words, your solution while technically great is pedagogical not wise. A typical geek approach to a problem ;)

rob74
0 replies
1d5h

The worst thing about dubbing is that it's more important for the translations to have roughly the same length and correspondence to the original mouth movements than to be accurate. So the original meaning is often altered, and you don't even know it because of course you have no easy access to the original most of the time. But unfortunately Germans are so used to dubbing that subtitles don't really stand a chance. There are a few cinemas here and there that show original-language movies with subtitles, and on TV there was one experiment that I'm aware of a few years ago (on Pro Sieben Maxx) to show TV series with subtitles, but it was cancelled after some time. AFAIK it's also more expensive to secure the rights to show English-language content compared to dubbed content.

poulsbohemian
0 replies
23h7m

I know Germany dub most video, but wouldn't a seven year old be able to read subtitles?

I gotta say... while sometimes it is a necessarily evil, I would so rather not have to read subtitles. I often want to listen to a show so that I can also continue working on catching up on email, etc, IE: I can't read two things at once, but I can listen to one thing and continue working on something else.

leobg
0 replies
1d10h

Yeah, this isn't really helpful for her to learn English. This is more when we watch The Anatomy Lab, or BBC's "The Incredible Human Journey". She'll already be asking me a lot of questions about the content. So if I had to translate on top of that, it would be tedious.

Subtitles - those are actually being generated as well. I've generated SRT files during development. Color coded by speaker, and on a per-word basis, for me to get the timing right.

Basically, if you have a YouTube channel, you can take any video from your channel, run it through Speakz.ai, and you'll get 15+ additional audio tracks in different languages, plus 15+ subtitle files (SRT).

Voice cloning and speaker diarization was a bit of a challenge. On the one hand, I want to duplicate the voice that is being spoken right now. On the other hand, sometimes "right now" is just a short "Yeah" (like in the Elon interview) which doesn't give you a lot of "meat" to work with in terms of embedding the voice's characteristics.

Right now, I'm using a mix of signals:

- Is the utterance padded by pauses before/after? - Is the utterance a complete sentence? - Does the voice of the utterance sound significantly different from the voice of the previous utterance?

It's a deep, deep rabbit hole. I was tempted to go much deeper. But I thought I better check if anybody besides myself actually cares before I do that... :)

darkwater
0 replies
1d10h

A 7yo can barely keep up with subtitles in their mother tongue, depending on the speed. And that's probably true for a p90 reader. A p50 there is no way it can follow subs understanding what they say. Now, being a video, they might be able to interpolate from what they see, so it might be a nice challenge. But doing this with subtitles in a foreign language is only for a few, privileged minds.

Source: father of a 8yo with VERY good reading skills (already reading books in 2 languages targeted at tweens)

scrollaway
12 replies
1d18h

I know this is HN so I don't want to distract from the technical achievement and how genuinely useful this can be.

I also don't want to tell you how to raise your kid. You do you, it's not my family. But I want to share how important it is to watch foreign spoken language movies and TV, especially as a kid, to be able to speak multiple languages later in life. You'll notice that in every country where TV and movies are regularly dubbed in the local language, the English levels go to shit. Dubbing is partially responsible for this because kids are not exposed to a different language on a regular basis.

I remember wanting to watch a dubbed movie with my mom as a kid, and she told me "We will watch the original instead, dubbed movies don't have a soul". It stuck with me. She was absolutely right. Today I am working on my sixth spoken language. Causation not guaranteed, merely implied.

jeroenhd
5 replies
1d14h

We will watch the original instead, dubbed movies don't have a soul

I disagree. It's all about the quality of the voice actors and the effort put into localisation.

Having grown up on Dutch dubs of many cartoons, I honestly find the Dutch voice actor of Spongebob better than the original. I'm missing the extra energy that the Dutch VA seems to have put into the voice when I hear the original, even if the original is very good. Though text on screen isn't translated, puns and references are, sometimes overhauled completely.

The talent pool for Dutch voice actors isn't as big as I would've liked (you often hear the same five VAs in every show on a given channel), but some of them really put in the work. Many of them only do kids TV and commercials (really freaked me out to hear Ash Ketchum try to sell me soap one day) and not every VA is as good/paid enough/gets decent scripts, but there are some real gems to be found in dubs.

Last year I found out how Ukrainian dubs work and I was astounded by how weird the experience was. I'm used to dubs having only the voice track swapped out, but the Ukrainian shows seemed to just have the acties talk over the original show, like this AI tool does, and I honestly can't imagine ever getting into a show that's dubbed like that. I assume people get used to this, but I found it rather annoying.

Blanket statements like "dubs have no soul" serve nobody. There are good dubs and there are bad dubs, and the ratio will probably differ depending on the language you're talking about. Dismissing all dubs ignores the real heart and soul some dubbing teams have put into their works. That doesn't mean I disagree with the idea of exposing kids to more languages, but I wouldn't expect kids to learn much from just TV shows and movies in the first place.

jamager
2 replies
1d11h

Voices in dubbed movies don't have any depth, for instance.

That doesn't have anything to do with the quality of the voice actors. Everything sounds flat because that is just how they record it.

Dubbing is a useful convenience, an accessibility feature (even if it wasn't born that way). But they have way less soul.

jeroenhd
1 replies
1d8h

I guess we just disagree, or maybe you're used to worse dubs than I am. There's nothing inherently flat about dubbing at all. In fact, in many (older) movies and shows, actors would dub over themselves to get better audio.

jamager
0 replies
1d1h

In a movie, if a character is far away, their voice comes from far away. Voice actors always have the mic in front of them, so their voices always come from the same place, not relative to the scene. That's what I meant.

I also think it is beautiful to hear the sound of the original language, particularly if it is one I am not used to. It's part of the charm.

I have grown up with dubs, thou, so I understand you. But once one gets used to no dubs, there is no way back. It's like removing sugar from the coffee.

imp0cat
1 replies
1d12h

I think the main point is that small kids get the basic building blocks for learning languages from anything they hear (even if they don't understand it yet), so listening to as many languages as possible when they are young will make learning languages easier for them later in their lives.

jeroenhd
0 replies
1d5h

I've heard this argument before, mostly from companies trying to sell language courses for kids. As far as I know it's true that kids pick up on languages much easier when they're young, but I'm not so sure those skills will stick if all they can converse with are the TV. That's quite different from having a speaking partner such as a teacher or a bilingual parent. I suspect this is why shows like Dora the Explorer are set up like an interactive game.

I myself have been exposed to subtitled English shows and movies all my life (not every show or movie was dubbed, and there were some German shows that made it through as well) but I don't think I actually started speaking any English until I needed it to interact with strangers in Runescape, while at the same time I stopped watching any dubbed shows. Almost all of the content I consumed became English language content.

Almost passively learning a language by enveloping oneself in it works (though actual study will help you advance quicker), but you need more than TV. I can't find the actual paper I read on this once (thanks, SEO spam!) but as I recall, the biggest advantage kids have to is learn pronunciation without an accent; picking up vocabulary and grammar don't seem to be too affected by age from what I recall.

Freak_NL
1 replies
1d9h

Mostly I agree with this, but for animated works dubs can be an integral part of the product when done right, and some are even tweaked for different languages (although I strongly reject adjusting the actual cultural content for different locales). The dubs have to be made in concert with the original though. There is also a lot of plain crap out there.

But absolutely; for anything featuring live action, dubs just damage the original.

I watch a German man building his massive Lego city on Youtube (narrated and recorded quite professionally) with my five year old son for a few minutes before bed. He is now at the point where he is trying to give this weird language (to him) a place in his head. Some words are familiar (being Dutch), some are foreign, and you can see the feedback loop happening when words do land; he wants to know what that man is saying. I don't except him to pick any German at this point, but the basics of immersion in another language are there.

scrollaway
0 replies
23h33m

Yes I agree with you. Actually, good-quality dubbed animated movies (= disney) is what I often use to help learn a new language.

Baeocystin
1 replies
1d16h

I can't say I agree with dubbed movies having no soul. Greater accessibility to a wider audience is not something to deride, or hold in contempt.

That being said, I do agree that listening to other languages is a great thing. My father was a linguist, and when we would watch subtitled media, we'd play a game where we'd try and hear the cognates, pick out the most common words, figure out the basics of the grammar as we went along. It was a lot of fun!

leobg
0 replies
1d8h

One of my favorite movies was "Scent Of A Woman". But when I watched it in the German-dubbed version, I was appalled. It made the whole movie suddenly seem like a comedy. To me, the translation had killed its "soul", for lack of a better term.

I still want my kids to learn English. And ideally also one or two other languages, like Chinese.

As Nietzsche said:

"So you have mounted on horseback? And now ride briskly up to your goal? Well, my friend - but your lame foot is also riding with you!"

true_religion
0 replies
1d17h

Counter point. My parents didn’t let me watch dubbed shows, and didn’t speak our native language because rhetorical wanted me to speak unaffected English.

I can’t speak any languages, but in school my English was insanely good. To the point of perfect scores in the college scholastic exams, and when I was in uni for engineering, I took on an English major for fun with essentially no impact on my work load.

You can generalize but you can also specialize.

leobg
0 replies
1d5h

FYI, I agree with you in all points.

As I said in another comment, I wouldn't want to live in a world where everything was dubbed into my language.

Any translation takes something away from the original. And dubbing even more so.

I also believe that being exposed to a foreign language long before you ever make a concious attempt to learn it is important. I wouldn't think I'd succeed in teaching my toddler to say "Daddy" if he hadn't been listening to the rest of us speaking for many months before.

I can see how this headline can make me seem like a bafoon of a dad. But I think I'm really not. :) When I watch The Anatomy Lab with my daughter, that's a time when I want our conversation to focus on how digestion works. Not on what the guy on the screen was saying just now. But of course there will also be times where I'll want our conversation to be about exactly that: What a foreign speaker just said. How those words come together. How the may have the same root as the words we use in German. Also, while AI has its place, I prefer to have these conversations with her myself.

waldrews
6 replies
1d17h

Impressively done! It sounds like you're doing

1) doing voice recognition with voice time clues, which Whisper and the like provide, breaking it up into sentence (or similar) units; you don't need to time match individual words, but you need to time match at coarser grain.

2) using a translation engine that allows for multiple alternative translations

3) cloning the original voice, regardless of language

4) choosing the translation that has the best time match (possibly by syllable counting, or by actually rendering and timing the translations). If there isn't a close translation, maybe you're asking ChatGPT to forcibly rephrase?

5) Maybe some modest pitch-corrected rate control to pick out path that gets you closest to the timing?

Did I get any of that right?

euazOn
2 replies
1d17h

I also noticed that the third sample with Chinese sounds slightly sped up in the first English segment, so there may be also an element of postprocessing the dub (speeding it up/slowing it down).

odiroot
0 replies
1d6h

The last sample from BBC is really hilarious when translated to Polish. Something definitely went wrong and the voice speaks like a drunkard.

leobg
0 replies
1d8h

Yes. Though I don't like this solution. It breaks the flow. And it also doesn't really fully solve the problem. Overruns still accumulate if they happen too frequently. One second here, one second there... the further you get into the video, the worse it gets.

I think it would be better to either slow down the underlying video or solve the overrun issue on the translation level. A good professional dubber will find translations that will even out in terms of timing. That's something an AI should be able to do better instead of worse.

waldrews
0 replies
1d16h

Ooh and you're probably doing a split into voice and non-voice tracks of the original, and keeping non-voice at original volume, but lowering the voice track.

leobg
0 replies
1d7h

Very good!

Yes, that's basically how it works.

I don't do any pitch-correction. But I do check the TTS output for lenght, and I re-generate if it doesn't match my time contraints.

I also have an arranger that tries to figure out when to play an utterance early (i.e. earlier than in the original) in order to make up for the translated version being longer.

I try to make the translations match the speaker's character, as well as the context. So ideally, Alex (Sample 2) will still say "Salut" even in German (instead of translating that greeting, too).

And I need to monitor for speaker changes. This is because I can't clone the voice unless I have a decent amount of sample data. If Elon just says "Yes", cloning the voice based on just that one syllable will make it sound like a robot. But I also can't just blindly grab any voice around it, since that might be somebody else's voice.

crtasm
6 replies
1d16h

What potential issues with copyright are there from offering (paid) access to this tool to run on sources including Youtube, and with the output containing the source audio?

Freak_NL
3 replies
1d10h

You're creating a derived work, so you would be violating someone's IP unless the licence is permissive, and making money of it. That usually attracts the attention of whoever owns the IP.

dns_snek
2 replies
1d4h

Who's "you"? I'm obviously not a lawyer, but instinct tells me that end users are the ones creating a derived work by uploading a video they may or may not have the right to distribute. Linked website is just a tool, perhaps a cloud-based one, but still just a tool.

crtasm
1 replies
1d2h

Not just files a user uploads: "You can select videos from YouTube"

dns_snek
0 replies
1d1h

Thanks, I missed that. I can see how that would complicate things.

leobg
1 replies
1d5h

You could have asked the same question when Google started building their index. Or OpenAI trained their models.

I'm in Germany. I'm a licensed lawyer. I see the dangers.

The safest path will be to simply offer the production of multi language translations to content owners themselves. Which is also going to be more efficient - translating the thing at the source, rather than having consumers each create their own translation.

But the original intent for this has been to have my computer translate a video I want to watch with my kids in my private home. Technically, it's not "my computer" in the sense of being just the device that's physically in my home. There's stuff that happens in the cloud. Technically, copies are being made. So one could argue the point.

For most people today, getting your content seen and consumed is the highest you can achieve. To sue someone from another country who cares enough to pay someone else to translate it for him would seem bonkers. But I'm sure there are lawyers who are desperate for work. Who cannot code, and can't be bothered to learn, but still want to do something "in AI". I'll at least give them a hard time. And dare they use ChatGPT hallucinated references on me! :)

crtasm
0 replies
1d4h

I was more thinking Youtube and record labels might take issue with the service, e.g. how they go after stream ripping sites.

Having the creator put a code in their channel description to verify ownership could be a good approach. Thanks for sharing the project!

changoplatanero
5 replies
1d13h

Why doesn't youtube have something like this built in.

Freak_NL
2 replies
1d10h

And automatically forced on every user depending on whatever their Google account is set to, just like the video titles which now get auto-translated without any way to turn this off. We're sliding back into a monolingual world.

No thanks.

leobg
1 replies
1d9h

As a native German, I also resent it when Google/Amazon/whoever tries to force a translation on me when I prefer the original language. So I wouldn't want to live in a world where everything would be dubbed into German for me. Not even if they used my tool :)

Regarding YouTube:

AFAIK, YouTube allows you to add multi-lingual voice tracks to your videos. Then, if the viewer has a preferred language set, the video will play in that language. Else in the language inferred by his browser/OS. But the user can also switch back to the original language, or any other language, right in the player.

Freak_NL
0 replies
1d9h

You can't switch to original titles though, so I'm not really confident Google is going to be offering this option for long.

madduci
1 replies
1d12h

I guess it's computationally expensive? OP states in the website that their solution takes 1 hour of processing for 30 minutes long videos .

Now imagine offering this for all the YouTube videos available:

- either it's done on their servers (hard to believe due to high costs) - either it's done on client side (which is also difficult, due to lack of processing power)

leobg
0 replies
1d9h

Well, I'm also shamefully unoptimized at the moment.

YouTube added auto-captions years ago. Long before there was Whisper, let alone things like Whisper.cpp. I imagine what I'm doing now is computationally no more expensive than what they did back then.

mannycalavera42
4 replies
1d10h

it's a lovely parenting story. Let me tell you there is also a huge opportunity for the opposite use case. My elderly parents speak only one (non-Eglish) language. I would love to have a (cheaper) way to provide my parents with translated videos with the addition of (translated) subtitles. Subs are important because elderlies can have hearing issues great work, inclusivity is love

leobg
1 replies
1d9h

I'm generating translated subtitles internally, before generating the voice-over. Also, generating those subtitles is way, way cheaper. If someone just wants the subtitles, I could offer them.

Bigger question is: What device are your parents using, and what content sources? Because I'd need to be able to download the audio, and inject the subtitles. With a regular TV, I wouldn't know how to do that.

mannycalavera42
0 replies
1d9h

Android device (either phone or tablet) that they can then send over to a chromecast

the chromecast could be a nice to have, not super necessary. they can put the tablet on a table close to them

felixarba
1 replies
1d9h

I second this! I was recently looking into a way to build something like this for my grandfather, but wasn't even sure where to start from the hardware side.

I wanted to have hardware plug into TV receiver, generate subtitles for live TV program and then play it back on TV. Delay would likely be less than a minute but even a few minutes is not a problem really.

Many people with a hearing problem would benefit from this and with AI getting so good at Speech-to-text, this can be done for quite a large population.

If anyone has a recommendation on where to start with this, I'd appreciate it! Was thinking of using Whisper for subtitle generation, but not sure about hardware that can take in, and output HDMI and run this software

leobg
0 replies
1d9h

I keep thinking about something similar. Also hardware. Also for my grandparents.

My grandma is 95. Her vision is bad. Even using the phone (I'm talking old school landline) is getting hit and miss, because she can't see the buttons.

Years ago, I set her up with an Echo Show. That works well enough for her to say "Call Leo". But Alexa is dumb. Sometimes, she'll mishear something and start playing music. Or start a monthly subscription... :)

So what I'd like:

- box - screen - far-field mic array - AI backend

You could do a number of things with it:

- manage a grocery shopping list (AI will notice duplicates and other oddities and ask) - communicate with the outside world (initiate calls, send emails and faxes, including to local businesses) - optional human oversight and/or permission settings (preventing the AI, say, from ordering groceries for more than $50 a week without a family member approving the order)

Something like your "subtitle mode" could also work:

"Listen to what is currently being spoken in the room (including the TV), and display it on the screen".

My grandma has her TV running all day. So maybe one could ditch the screen and make it a "set top box". Add an IR port to it, so it can control also the TV itself. Something like that might work.

artninja1988
4 replies
1d18h

Why do you keep the original audio track on the dubbed version? I think it sounds pretty distracting, although you do want to keep sounds other than the original voice I guess

askhan
1 replies
1d18h

I think these little bits of the original sound really works well, helping us hear the original as well and keeping it less uncanny.

What an amazing project!

leobg
0 replies
1d6h

Thank you.

Exactly. I had the OG voice removed at first. But I added it back in for exactly this reason. It also serves as a tool for AI accountability: It lets you "see" that the cloned voice is indeed saying the same thing as the original voice.

That being said, it would be trivial to turn the OG voice off for anyone who wants to.

maxglute
0 replies
1d17h

I dig it, it's like amateur underground over dubs of bootleg hollywood movies. Don't want to conflate amateur AI voice with actual personality on screen. The seperation is part of the experience.

gardenhedge
0 replies
1d10h

News channels do that for interviews

Timwi
4 replies
1d10h

Would have been nice to get a download link for a program I can just run locally.

bbitmaster
3 replies
1d9h

Alas! Didn't get the memo? We're In the era where AI tools are all "software as a service," and you must pay for individual inferences from the model. How could they charge for inferences if they gave you the model to download?

leobg
1 replies
1d8h

I don't have access to any special model that I'm holding back from you in order to rent-seek.

Anyone can learn to build something like this. The parts are all available out there. There's Whisper. There's Mistral-7b. There's Tortoise, Coqui, SV2TTS. There's Python.

The bigger question is:

Would you want to?

I've been building web apps for several years now. I've sunk thousands of dollars into those projects. And literally years of my time. If I calculated my hourly wage, I'd be below a teenager mowing lawn. In Rwanda. And by a factor of 10x, probably.

The real ROI here is the learning. And that's not something I'm "taking" from anyone.

rubymamis
0 replies
1d6h

You are awesome, man! Keep it going and all the best of luck!

Timwi
0 replies
1d9h

Amen, bro.

exitb
3 replies
1d10h

I suppose you're paying a lot for the voice cloning, so do consider that in many countries voice-overs are done with a single generic voice. Would you consider a lower price point service doing just that?

I'm doing something similar and using GPT-4 for translation. What's unique about it, is that you can specifically prompt it to avoid long translations by rephrasing things, so you can buy yourself some time for the "Fledermaus".

leobg
2 replies
1d9h

Using a single speaker makes sense when you're paying for human voice talents. But if you're using a computer to generate the voice, why not generate a voice that sounds like the original speaker? Much more fun to hear Elon speak Chinese :)

exitb
1 replies
1d9h

To be blunt, because of the price. I’m running the whole pipeline of my toy project for much less than $5 per hour. If the voice cloning is the long pole in the tent, I’d just consider dropping it.

Moreover, it’s a cultural custom. I’m from Poland and here the voice-over narrator is supposed to be generic and bland, so your brain learns to tune him out and take the emotional cues from the original voice.

leobg
0 replies
1d6h

Yes, in Germany we have generic voice-over narrators, too. In documentaries, etc.. They usually match the gender of the speaker, but that's it.

Personally, I read most of my books with the iOS app Voice Dream Reader. That app still uses old TTS voices. They sounded great 3 years ago, but now sound robotic when compared to Elevenlabs or WaveNet. But, as you say, you learn to tune out the voice. I can read entire novels like this, and I still "hear" the different voices and personalities. It just all happens in my imagination.

How much I'd need to charge to make the project worthwhile depends on many factors. And I didn't want to name a price now and then backpaddle a month from now and say it'll actually cost more.

My pipeline right now is super unoptimized, to the point of being embarrassing. This can all be made to run much faster and cheaper.

I agree with you that if the voice cloning part of my pipeline causes a significant chunk of the cost that the end user pays for the service, that I should then offer the option of using a "bland" voice instead for a lower price.

solardev
2 replies
1d17h

This is really impressive! Can't wait to see this more fleshed out. I'd gladly pay for something like this (by the video, ideally).

Some page feedback though: It seems to me that the video just keeps playing, with no way to restart it or scrub through the timeline. Each time I click a language, it changes the spoken audio but just keeps playing where it left off. That makes it hard to compare the same passage across different languages.

Separately, I think there are also some errors in translation. For Sample 3 (about the vines), the original in Mandarin Chinese says something like "if this tree gets grabbed, the weed will climb up and wrap around it, and the tree won't be able to photosynthesize and will die". But the English mistranslation says "If it gets scared by people, it gets pulled off and messed with. It can't function. The evil effects? It just dies."

There are also timing issues where the translations don't match up with the original subtitles or dialogue, and certain parts of the original audio just seem to be altogether ignored and not translated.

Maybe displaying the translated subtitles, allow with a way for users to report errors, would help...?

leobg
1 replies
1d7h

Thank you very much!

Yes. You cannot control the video playback on the demo page. I made it so because I wanted a way to showcase how you can switch between languages. You can go from Elon speaking English to German, Russian and Chinese, each with just one click. Activating the player controls would have made the UI more complex and distracting. And it would have also made it harder for me to sync the timing between languages.

Of course, the real output would be a proper player, with all of the controls. Or, for creators, raw files (video and/or audio, plus SRT subtitles).

I also noticed problems in the translation of the Chinese video. I put it up there anyway, because I figured most people coming to my site would be English speakers, and being able to understand a Chinese video might be another interesting aspect, in addition to the idea of being able to turn your own English content into languages you don't speak.

If this had been a pitch deck, I would have cherry picked the samples. But I wanted to share where the project is right now and see if anyone was interested. Premature optimization is the root evil of all programming. I think Knuth said that. And it's a trap I regularly fall into. So I tried to be disciplined this time.

But if any Chinese YouTuber would ask me to dub their work today, I'd make darn sure that the translations were close to perfect. Meaning I'd allow the system to make changes to the way things are phrased if that's necessary for the purpose of timing or cultural context. But I wouldn't allow it to skip a thought from the original video, or say something something different.

I've translated books by hand in the past. So this is something I care about. If the demo isn't perfect in this regard, it's because I didn't know if anyone was going to even look at my project. When I first posted this yesterday, my submissions didn't go beyond one comment for several hours. I already thought I had built another solution looking for a problem. :)

If you're seeing dropped phrases, that's most likely because my arranging function failed. Basically, the translation ran longer than the original. The algorithm tried to speed it up and fit it in. But it failed and dropped it. Better handling of these overruns are on my to-do list. Neither drops nor speedups should be tolerated.

In terms of self-correction, I plan to feed the translated audio back into the transcription engine. Then, an LLM can compare the translation with the original transcript. If anything is missing, the pipeline will be force to run again with slightly different parameters. There shouldn't be a human neccessary in the loop. Translation is what Transformers are best at.

solardev
0 replies
1d2h

Gotcha, thanks for the great walk-through and in-depth explanations! Excited to see how this thing progresses.

I'd totally pay to have something like this as a Chrome plugin for YouTube, for example.

pavelboyko
2 replies
1d6h

Please consider adding Simplified English as an output language option, preferably with a level, e.g., A2, B1, etc. This way, I can adjust the language complexity to my kids' level and then gradually remove the crutches as they improve in English.

leobg
1 replies
1d6h

Yes! I love this!

So you'd be translating English to Simplified English? Or are you talking from another source language?

I've already been playing with this concept w.r.t. books:

I take a non-fiction book. I'll have an LLM translate it with a specific audience in mind (say, a 7 year old girl with a certain background), explaining concepts and words that are likely unknown to that audience. And then converting the whole thing into an audiobook. Optional parental controls built in ("exclude violence", etc.). Nowhere near showtime, though.

Another thing I'd love to work on is filtering existing content. There are millions of videos on YouTube. Right now, finding quality stuff that's fun to watch with my kid depends a lot on dumb luck. But what if I could filter by topic (semantic whitelist/blacklist, i.e. not keyword dependent), personality traits (OCEAN, MBTI), values (e.g. "curiousity") and language (reading level, vocabulary, words per minute, etc.)? I'd love that.

vidarh
0 replies
1d5h

I'd love what they suggested as well, for other languages. I'm working on improving my French (and occasionally German), and I'm at a stage where I can follow along some French shows reasonably well if they're not speaking too fast (one of the first French phrases my French teacher in school taught us was "plus lentement, s'il vous plaît" - "slower/slowly please", for a reason), and if they're not speaking any particularly difficult accents, and not too much slang, but it's limiting and I'm often forced to keep English subtitles on as a consequence and it's sometimes too much of a crutch. It doesn't help that my hearing isn't what it was.

Being able to "step down" the difficulty so that I can either turn off subtitles entirely or rely on French subtitles, or even much "difficult speech" and "simple subtitles" or vice versa seems like it'd be very useful in getting over that hump faster.

maxglute
2 replies
1d17h

Very passable. Waiting for something local like this for foreign language PLEX and podcasts. As someone who views/listens to things at 2x/3x speed 10-15 bucks an hour is cost prohibitive.

leobg
1 replies
1d5h

Perhaps you could ask some of your favorite podcast hosts to make a deal with me. Running a training on their voice once and then just re-using that will be much cheaper. Also, customers who buy in bulk will help me focus on this full time. There is huge potential for making this faster and cheaper.

(Even using OpenAI is silly. Technically, I neither need GPT-4's knowledge nor instruction tuning. Both is unnecessarily adding cost and latency. But it helped me get the demo out.)

Basically, the deal for Podcasters / YouTubers would be:

- Get all their episodes converted into 15+ languages - Increase their reach today, while the novelty is high and the market is still uncrowded - They get to tell their sponsors that they now have reach across the language boundary

maxglute
0 replies
20h9m

I don't know state of podcast ecosystem, but I think you should reach out to listennotes.com whose also a 1 man job that seems to elevate discovery and looks like has reasonable reach for producers. Or go hit up some popular western podcasts, you've definitely got something here and the execution is good enough.

lxe
2 replies
1d18h

What does this use behind the scenes? This type of stuff can get pretty expensive if you're relying on elevenlabs or heygen.

leobg
0 replies
1d8h

Let's just say this is NOT a wrapper around Elevenlabs or Heygen. I've looked at commercial voice cloning before. But, as you said, the prices seemed ridiculous.

Before this, I made audiobooks for my daughter. Old, out-of-print books, turned into speech. If I remember correctly, with Elevenlabs a single book would have cost me > $100. At that price level, I can read the damn thing myself. What good is computer generated voice if it isn't at least 10x cheaper than doing it yourself?

I'm just one guy. With me, it's just my time, one or two commercial licenses, and other than that just the raw price of running those GPUs.

brody_hamer
0 replies
1d15h

Yea my guess would be elevenlabs, which just recently announced this exact featureset.

jeroenhd
2 replies
1d14h

I've only ever experienced Dutch dubs in kids' TV but I feel like these examples show that your Dutch model may need some work. I can't judge other languages well, but I found the Taiwanese documentary dub especially hard to follow. I wouldn't have expected Dutch to be in there for how little the language is spoken and how often Dutch speakers will understand English, though!

/offtopic It seems to do a pretty interesting thing where the first male voice has a bit of a Flemish/southern accent while the second male voice has an accent much closer to "Netherlands TV" Dutch. Reminded me a bit of the Lion King dub where the dub studio used Flemish voice actors to do the jungle animals (and Dutch voice actors for the savannah animals) to underline the "different world" Simba arrived in.

leobg
1 replies
1d7h

Yes, that issue is also present in the German translation.

I'm planning to monitor the output quality. Basically, feeding the translated audio back into the transcriber. Then compare it to the original transcript. Like a higher level loss function. I'll need this already because I don't speak all of these languages myself. But I can also use it to make the pipeline self-regulate and generate a new, better version if the last one scored too poorly.

jeroenhd
0 replies
1d6h

Interesting, I can see how that approach would catch the weird voice lines.

Just the different ways the languages get picked up and processed by the AI system could be interesting. If you find anything cool, I'd love to read a blog post about it!

daremon
2 replies
1d17h

This is really amazing! Well done.

I already joined the beta but I want to point out another use case here as well:

In many countries (ie Greece where I'm from) movies and TV shows never get dubbed. We rely on subtitles. This means that if you can't see well (disability or age-related eye problems) and if your English is not excellent, then you are doomed to only watch locally produced movies & shows.

This can be a real life-changer.

leobg
1 replies
1d7h

Thank you!

With movies, I think I could get into legally challenging territory. I guess all AI apps are, in a way. But with movies, there's an entire industry behind enforcing copyright. So I must tread carefully on that front.

I made the jump from the courtroom into VS Code years ago. I really don't want to go backwards.

daremon
0 replies
1d3h

I honestly don't see how movies are different with any content ie YouTube videos. I am pretty sure MrBeast etc have the same lawyers as any big studio.

Could this run locally? I would certainly pay for that and you're off the hook on how anyone uses it.

auct
2 replies
2d11h

Can you add ukrainian?

leobg
1 replies
2d11h

Hey! As a target/output language? As a source language, it’s already supported.

oneshtein
0 replies
1d10h

Yes, as target language. And remove Яussian language, please, unless you are Яepublican. You can add it back after the war.

YKreator
2 replies
1d4h

Congratulations on this project! We spent 6 years developing the best solution for generating perfect subtitles automatically (https://www.Checksub.com). 2 years ago, with the arrival of new generations of AI, we decided to go a step further and add the possibility of automatically dubbing videos. But automatic dubbing requires manual adjustments for a comfortable result for the audience. For example, https://www.HeyGen.com generates a video automatically, but offers very few editing options. That's why we focus on two things:

1 - to provide the best possible automatic quality 2 - to offer an advanced editor that lets you fine-tune your dubbing without having to go back to editing software.

In any case, I'm delighted to see people working on this problem. I hope it will help develop this sector.

luxpir
0 replies
1d3h

Thanks for Checksub, another happy user here.

We take multilingual, AI-cloned audio from you guys (split from background noise, of course), after we've processed the subs professionally at our end, then we align everything in your tool and send off to a third service for lip sync. The result has blown away a few clients now. The CEO speaking 5 languages with perfect lip sync in their own tone of voice is quite convincing.

Hopefully we can get it all in one tool soon.

leobg
0 replies
1d4h

Great website! You're based in France? You should put a demo on your website. If there is one, I don't see it (or rather: hear). If you're interested in collaborating, me email is in my bio.

Gunnerhead
2 replies
1d12h

Amazing! I love this for dubbing, but was wondering if anyone knows of an AI powered subtitle generator for YouTube videos? I know YouTube has closed caption, but it’s terrible.

leobg
1 replies
1d5h

Speakz actually generates subtitles as a byproduct. The idea is that you put in a video, you select the target languages. And then you get out, for each target language, an audio track and an SRT subtitles file.

Someone else here asked about generating only subtitles, with no audio, as a cheaper option. So I'll probably add that as an option.

Gunnerhead
0 replies
1d3h

I would love that to help learn another language!

theogravity
1 replies
1d17h

Wonder how accuracy of the translation is measured (if at all).

leobg
0 replies
1d8h

One idea I have is to use back-translation. After generating the new language audio, feed it back into the transcription, and then have an LLM compare it to the transcript from the original. Penalty for any thought/detail that is missing. If too bad, start from scratch.

simple10
1 replies
1d18h

This looks amazing! Thanks for sharing. I signed up for the private beta.

leobg
0 replies
1d10h

Thank you so much!

pcchristie
1 replies
12h15m

This is extremely impressive. Congratulations!

leobg
0 replies
12h14m

Thank you!

lIIllIIllIIllII
1 replies
1d16h

This might be a game-changer for preserving declining languages.

leobg
0 replies
1d8h

I'm not sure.

Making this work is limited by the availability by reliable transcription models. Which, in turn, are limited the availability of large training corpora. Those don't exist for rare languages.

Also, if people choose to listen to a Nepali speaker through an AI translator, that does give speakers of this language "a voice" - but it doesn't really preserve that language. You might argue, on the contrary, that it may remove any remaining incentives to learn that language.

jianshen
1 replies
1d17h

Wow this is amazing. If there was a locally running version available, I would gladly pay money for it.

leobg
0 replies
1d8h

Thanks!

Well, to make local happen I'd have to learn more about local app development.

I'd also be worried about having to support a bunch of different platforms, and being beholden to ever changing rules made by App Stores and OS makers. I actually work on a 2015 Mac with a 2019 operating system. There are many great looking AI apps that I'd love to run but can't.

Besides, it seems to me that making this centralized makes economic sense. I can just keep the GPU busy with lots of videos from many customers. I'm sure that's what most people think who build something: "The world would be so much better if everyone just came here and used this." :)

hombre_fatal
1 replies
1d3h

What a great application of AI. The samples were amazing.

leobg
0 replies
1d1h

Thank you!

hbarka
1 replies
1d15h

Wow, this is impressive. Is there anything like it for live translation?

leobg
0 replies
1d8h

Not on my to do list currently. EzDubs say they do live [0]. Also, a friend of mine mentioned some Samsung / Android app that does this?

[0] https://ezdubs.ai

godzillabrennus
1 replies
1d15h

Would love this for Plex.

leobg
0 replies
1d10h

Would love to. Can you broker a deal? :)

leobg
0 replies
1d4h

They use pyannote/speaker-diarization. I tried that, but it wasn't accurate enough for my purposes. Made a confusion matrix with voice samples from The Simpsons characters. It looked... well, confused.

Am using a mix now of speaker embeddings and other signals (end of sentence, pause before/after, etc.). As you can see in the demos, it already works well for interview situations. It's when there are 3+ speakers and they talk over each other that the system gets confused.

bufferoverflow
1 replies
1d18h

I speak Russian, and I gotta say, the Lex sample is incredible. It sounds like real dubbing. Maybe not pro-level dubbing, but it's very very good. They voices are also very close to Lex's and Elon's.

Congrats! Very well done.

leobg
0 replies
1d9h

Thank you! Yeah, I also had a big grin on my face, hearing Lex and Elon suddenly talk in another language. :)

Consistency isn't perfect yet, as we're building the voice from scratch basically for each utterance. One the one hand, you want that, because the utterance might be more upbeat, or lower pitch, than the speaker's "average" voice. On the other hand, it sometimes introduces variance that makes the listener's brain go, "Uh.... is that another person speaking now?". If I had to dub 200 videos of a single YouTube channel, I would be able to fine-tune the voices of the main characters, and reserve the ad-hoc cloning for guest characters.

welder
0 replies
1d9h

I do the opposite and seek out videos in other languages for my kids to watch. Now they're learning German, Spanish, Chinese, and Japanese.

stuckkeys
0 replies
1d5h

Cool project. What is the tech stack behind it. I can already guess few. 11labs, OpenAI…dreamtalk. There are so many similar services to what you are doing. What sets you apart? You should partner with local media or outside. Good luck!

Check out https://www.flawlessai.com they been around since 2018. Interesting stuff. When I first saw it in 2021 I was blown away.

sss111
0 replies
1d17h

can you add hindi as an output language, been meaning to build something like this for my parents. You saved me some work haha!

poulsbohemian
0 replies
23h4m

You deserve a lot of credit for doing this... often when I am watching German content that has English subtitles, I wonder if the subtitles are already being produced by AI... I sometimes find the subtitles more confusing (even though they are in my native language) than the German, as though they almost had to have been automatically produced rather than by an actual translator using contextual clues, etc.

poulsbohemian
0 replies
23h9m

This is interesting because I'm the opposite of you - a German speaking American, who watches a lot of German language content on YouTube. Are you specifically looking for children's content? I ask because almost anything I would watch in English, I can find an equivalent content producer in German.

pell
0 replies
20h29m

This is very impressive. I think the way you are timing the audio is clever. What kind of model are you using?

patrickhogan1
0 replies
1d15h

Awesome! Large % of foreign streams have no proper dub.

oldge
0 replies
1d19h

Very cool, any chance to see a on device release so we can run this locally? Topaz ai has a pretty good model for this if you are looking to monetize.

ocolegro
0 replies
1d12h

So when is the crunchyroll integration rolling out?

leke
0 replies
1d12h

I'm looking for a tool that will take a foreign language and automatically generate subtitles in my language. Anyone know of such a tool?

k__
0 replies
1d3h

Would be awesome if it could also deep-fake the voice.

ipsum2
0 replies
1d7h

Does this use seamless4t or similar projects?

iamjackg
0 replies
23h19m

Is this using XTTS? I recognize a funny/weird glitch with Italian voices saying "punto" (full stop) at the end of every sentence.

gagabity
0 replies
1d17h

This is great, I tried to do a similar thing once but my language is one of those that AI doesnt do well.

I think you can look into muting the original voice in the video, I remember I saw there is some AI/tech that can separate audio into voice and nonvoice.

Yandex browser does this in the browser, you open a YT video and it offers to translate, a few seconds later the voices are all in Russian, it's probably the most interesting production use of AI I have seen and for free. It's to Russian only unfortunately.

cushpush
0 replies
1d19h

This is really amazing I'm very impressed and happy you released this. Can you share more details about your development rhythm of this helpful piece of software?

azamba
0 replies
1d16h

What would be required to add a new output language? E.g Portuguese? I know it’s supported as input.

android521
0 replies
1d8h

any opensource available?

2099miles
0 replies
1d17h

Talked about this idea last month since astrobiology still isn’t all dubbed to English. Thank you for actually making the tool, it’s awesome, huge Kudos.