return to table of content

WhisperSpeech – An open source text-to-speech system built by inverting Whisper

siraben
17 replies
13h50m

Interested to see how it performs for Mandarin Chinese speech synthesis, especially with prosody and emotion. The highest quality open source model I've seen so far is EmotiVoice[0], which I've made a CLI wrapper around to generate audio for flashcards.[1] For EmotiVoice, you can apparently also clone your own voice with a GPU, but I have not tested this.[2]

[0] https://github.com/netease-youdao/EmotiVoice

[1] https://github.com/siraben/emotivoice-cli

[2] https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Clon...

jpcl
6 replies
7h19m

Hi, WhisperSpeech dev here, we only support Polish and English at the moment but we just finished doing some inference optimizations and are looking to add more languages.

What we seem to need is high-quality speech recordings in any language (audiobooks are great) and some recordings for each target language which can be low-quality but need varied prosody/emotions (otherwise everything we generate will sound like an audiobook).

yorwba
5 replies
4h8m

Last I checked, LibriVox had about 11 hours of Mandarin audiobooks and Common Voice has 234 validated hours of "Chinese (China)" (probably corresponding to Mandarin as spoken on the mainland paired with text in Simplified characters, but who knows) and 77 validated hours of "Chinese (Taiwan)" (probably Taiwanese Mandarin paired with Traditional characters).

Not sure whether that's enough data for you. (If you need paired text for the LibriVox audiobooks, I can provide you with versions where I "fixed" the original text to match the audiobook content e.g. when someone skipped a line.)

jpcl
3 replies
3h51m

For Polish I have around 700hr. I suspect that we will need less hours if we add more languages since they do overlap to some extent.

Fixed transcripts would be nice although we need to align them with the audio really precisely (we cut the audio into 30 second chunks and we pretty much need to have the exact text in every chunk). It seems this can be solved with forced alignment algorithms but I have not dived into that yet.

yorwba
2 replies
2h47m

I have forced alignments, too.

E.g. for the True Story of Ah Q https://github.com/Yorwba/LiteratureForEyesAndEars/tree/mast... .align.json is my homegrown alignment format, .srt are standard subtitles, .txt is the text, but note that in some places I have [[original text||what it is pronounced as]] annotations to make the forced alignment work better. (E.g. the "." in LibriVox.org, pronounced as 點 "diǎn" in Mandarin.) Oh, and cmn-Hans is the same thing transliterated into Simplified Chinese.

The corresponding LibriVox URL is predictably https://librivox.org/the-true-story-of-ah-q-by-xun-lu/

jpcl
1 replies
1h13m

Thanks, I'll check it out. I don't know any Chinese so I'll probably reach out to you for some help :)

yorwba
0 replies
1h10m

Sure, feel free to email me at the address included in every commit in the repo.

freedomben
0 replies
2h32m

Librivox seems like a great source, being public domain, though the quality is highly variable.

I can recommend Elizabeth Klett as a good narrator. I've sampled her recordings of Jane Austen books Emma, pride and prejudice, and sense and sensibility.

wferrell
5 replies
13h43m

Have you released your flashcard app?

nmfisher
1 replies
12h39m

If you're interested, I have a small side project (https://imaginanki.com) for generating Anki decks with images + speech (via SDXL/Azure).

westurner
0 replies
10h36m

Some language learning resources From "Show HN: Open-source tool for creating courses like Duolingo" (2023) https://news.ycombinator.com/item?id=38317345 :

ENH: Generate Anki decks with {IPA symbols, Greek letters w/ LaTeX for math and science,
knubie
1 replies
10h5m

Not OP, but I develop Mochi [0] which is a spaced repetition flash card app that has text-to-speech and a bunch of other stuff built in (transcription, dictionaries, etc.) that you might be interested in.

[0] https://mochi.cards

siraben
0 replies
3h29m

What spaced repetition algorithm does it use?

siraben
0 replies
13h41m

It's just an Anki deck.

modeless
2 replies
13h33m

Did you try XTTS v2 for Mandarin? I'm curious how it compares with EmotiVoice.

thorum
1 replies
8h52m

It has a big problem with hallucination in Chinese, random extra syllables all over the place.

modeless
0 replies
3h3m

Makes sense, I get hallucinations in English too.

colordrops
0 replies
11h14m

Just listened to the demo voices for EmotiVoice and WhisperSpeech. I think WhisperSpeech edges out EmotiVoice. EmotiVoice sounds like it was trained on English spoken by non-native speakers.

samstave
13 replies
13h29m

Aside: is it just me, or is anyone else just as dumbfounded with how quickly literally every aspect of AI and LLMs and Models and blah blah blah is going?

Am I weird in just having my head spin - even though I've also been at leading edge tech before, but this is just me yelling at these new algos on my lawn?

jazzyjackson
10 replies
13h14m

on the contrary I'm really disappointed in how long its taking anything to get into production.

Whisper and self-hostable LLMs had a cambrian explosion about 1 year ago, I attended a GPT4 hackathon last March and in 48 hours saw people hook up Speech2Text -> LLM -> Text2Speech pipelines for their live demos. I thought we would all have babelfish by June.

Months later I later attended some conferences with international speakers that really wanted to have live, translated-on-the-fly captions, but there wasn't anything off the shelf they could use. I found a helpful repo to use whisper with rolling transcription but struggled to get the python prerequisites installed (involving hardlinking to a tensorflow repo for my particular version of m1 CPU). It was humbling and also hype-busting to realize that it takes time to productize, and that the LLMs are not magic that can write these applications themselves.

In the meantime even Google hasn't bothered to run the improved transcription models on YouTube videos. They are still old 80% accurate tech that's useless on anyone with an accent.

ricketycricket
1 replies
12h2m

I built this last March. It captures audio from a live HLS stream and transcribes and translates into 18 languages on the fly. Used by a customer with about 25K international employees for their internal events. Works surprisingly well.

jazzyjackson
0 replies
9h34m

Fabulous, guess that's the other part of productizing: a paying customer!

huijzer
1 replies
12h44m

on the contrary I'm really disappointed in how long its taking anything to get into production.

I agree. I was thinking about making a Jarvis like bot which should be pretty easy at this point. The main problem was that my iPhone doesn’t easily allow for pressing a button upon which it starts listening. You always need to unlock first at which the whole screen gets unlocked too. Maybe these kind of GUI-focussed interfaces are blocking a lot of ideas? At the same time it’s great that people will come up with new devices and these will compete somewhat with phones.

Havoc
0 replies
6h38m

The tap on back might work without unlock and i think that can be set to a custom shortcut

Rodeoclash
1 replies
12h55m

I'd be interested if you ever dig anything up for this. I hacked together a kind of crude tool to snapshot audio and translate / caption it on the fly:

https://captioner.richardson.co.nz/

I would very much like to improve on this but the live translation / captioning still has some more work to go in this space.

Source was here: https://github.com/Rodeoclash/captioner

follower
0 replies
11h31m

I was going to suggest considering looking into vosk but... clearly that suggestion isn't very useful to you. :)

vineet202
0 replies
6h39m

Check out WhisperLive: https://github.com/collabora/WhisperLive

If you're grappling with the slow march from cool tech demos to real-world language model apps, you might wanna check out WhisperLive. It's this rad open-source project that’s all about leveraging Whisper models for slick live transcription. Think real-time, on-the-fly translated captions for those global meetups. It's a neat example of practical, user-focused tech in action. Dive into the details on their GitHub page

taneq
0 replies
12h1m

I'm really disappointed in how long its taking anything to get into production.

It was humbling and also hype-busting to realize that it takes time to productize

Yep, looks like you found out why it’s taking so long to get this new tech into production. The gap between nothing and a proof of concept is, in some ways, much smaller than the gap between proof of concept and commercial product.

samstave
0 replies
2h45m

You're focuded on whisper/voice stuff...

I was making a more general statement... I havent even had time to personally look at any voice stuff...

Too many Shiny Things and too much ADHD in the koolaide.

pksebben
0 replies
1h53m

I have a similar frustration with the lack of tooling around all this stuff.

Like, you had the time to train a bajillion parameter model with a ton of attendant code, but an installation script was a bridge too far. I get that python dependency management sucks, but you had to do it at least once for yourself.

Of course, here I am reinstalling CUdnn for the umpteenth time because this software is provided free of charge and it sprinkles magical fairy dust on my GPU so perhaps I shouldn't whine about it.

colechristensen
1 replies
13h20m

This is the structure of revolutions, particularly of this kind. Exponential growth looks like this.

In particular with the generation / recognition abilities of ML models, they have this feature of being a curiosity but not quite useful... so if a speech recognition program goes from 50% accuracy to 75% accuracy it's a huge accomplishment but the program is still approximately as useless when it's done. Going from 98% to 99% accuracy on the other hand still cuts the errors in half, but it's super impressive going from something that's useful but makes mistakes to making half as many mistakes. Once you hit the threshold of minimum usefulness the exponential growth seems like it's sudden and amazing when it's actually been going on for a long time.

At the same time, we've had a few great improvements in methodology with how models are designs (like transformers) and the first iterations showed how impressive things could be but were full of inefficiencies and we're watching those go away rather quickly.

lioeters
0 replies
3h49m

structure of revolutions

For anyone who hasn't heard of it, this phrase is a reference to the theory of paradigm shifts in scientific progress, introduced in the book "The Structure of Scientific Revolutions" by Thomas Kuhn.

https://en.wikipedia.org/wiki/The_Structure_of_Scientific_Re...

huytersd
11 replies
13h7m

What’s the text to speech generator that chatGPT uses? It’s the most impressive one I’ve heard so far.

miki123211
7 replies
11h50m

If you think OpenAI's TTS is impressive, you should check out Eleven Labs. They have the highest quality models IMO. Voice quality, emotional awareness / inflection and support for foreign languages are top-notch, it's that last point that OpenAI seems to have the most issues with. If you find a good voice to clone, the latest models can even replicate somewhat unusual accents and speaking styles.

For plain old English TTS with a stock voice, there isn't that much of a difference (although Eleven Labs still wins IMO), but if you need either voice cloning or foreign language support, nothing else comes even close.

With that said, Eleven is extremely pricy, something like Azure TTS (which is the best among the cheap options) may be a better fit for less demanding applications.

reissbaker
5 replies
10h40m

The quality difference between Eleven and OpenAI is IMO pretty small, but the price difference is enormous: for 50,000 characters (approx 1hr of audio, by Eleven's estimates), you'd pay Eleven Labs $9 assuming you're in their highest $330/month payment commitment tier; for OpenAI there's no minimum commitment and the same number of characters would cost $0.75.

If you're generating speech once and replaying it many times (e.g. making podcasts), the difference is negligible and you might as well go with Eleven Labs, since it's more customizable and possibly slightly higher quality. If you're doing interactive speech with customers, $9/hr is incredibly expensive (higher than hiring a minimum-wage worker in the U.S.!), and OpenAI's TTS is a very close second best and much more reasonably priced. If you're trying to integrate speech into an AI product, Eleven makes your hourly costs pretty unfeasible since you have to at minimum charge your customers more than it costs to hire a human being to do a task.

Azure's "Neural" line of TTS is the best of the big cloud offerings, but it's pretty mediocre compared to either OpenAI or Eleven Labs IMO. And it's actually more expensive than using OpenAI: it's $0.80 for 50,000 characters (~1hr), unless you're willing to commit to over $1k monthly spend, at which point it's barely cheaper than OpenAI at $0.64 per 50k characters.

OpenAI's TTS is IMO the best option for anything interactive, since it's so much higher quality than Azure's Neural TTS and so much cheaper (with very little quality difference) as compared to Eleven Labs.

BeetleB
4 replies
2h37m

For anyone reading, in case you want a whole order of magnitude cheaper, just go with Google Cloud TTS. For many voices, you get 1 million characters free per month, and even beyond that it's ridiculously cheap. Some voices do sound artificial, but many sound quite human - the only tells are the relatively consistent tone and section ends (no appropriate pauses).

I don't read long articles any more. I have a script that extracts the text, does TTS via Google Cloud, and adds it to my podcast so I can listen to it while driving. Been doing this for months and haven't paid a cent.

stavros
1 replies
1h52m

That's a good suggestion, thank you. Would it be possible to post some code? I've found GCP's APIs/documentation to be a bit abstruse.

BeetleB
0 replies
1h5m

Actually, all the code related to this was copied from the docs, almost verbatim.

(But yes, their docs suck).

ametrau
1 replies
1h27m

None of their available voices are as good as ms

BeetleB
0 replies
1h6m

ms?

huytersd
0 replies
11h37m

Maybe I’m not a good judge but OpenAI’s voices sound very natural to me and seem better than Eleven labs.

GaggiX
1 replies
13h0m

They use their own models, and we don't know anything about their architecture (I believe), but you can use them with the OpenAI API.

huac
0 replies
2h40m

you could make an informed guess by looking at what the highest quality open source model is, looking at the current employer of that model's creator, and what they currently work on there

etguy
0 replies
13h2m

OpenAI’s own models. They’re also available commercially via API and are pretty affordable.

nmfisher
9 replies
12h22m

I've been following jpc [0] on the LAION discord since he started building this last year, and it's a very impressive project.

The key here is that the Whisper multilingual ASR model has been trained on a huge amount of data, so its encoder output is a very good representation of the semantic content of speech. This can be used as an open-source, drop-in replacement for the semantic encoder in model architectures like SPEAR-TTS/VALL-E/etc (whose semantic encoders are not publicly available). This is then used to predict acoustic tokens (the output from the quantized/low-bandwidth Encodec audio codec) which is then upsampled/denoised/enhanced with the Vocos vocoder.

I know someone is working on Hindi but it would be great to see this extended to other languages for a properly open-source [1], multilingual TTS platform. I think the main bottleneck at the moment is finding people who can procure/clean compliant datasets.

[0] https://github.com/jpc [1] jpc/Collabora went to great efforts to ensure that they are only using properly licensed data to train this. I doubt Whisper itself was that compliant, so it's a bit muddy.

jpcl
4 replies
6h49m

Yeah, Whisper is not clear-cut but since it is not a generative model I think their data usage is a lot more likely to be considered fair-use.

And the part of that which we use for WhisperSpeech is just the phonetic representation so our model is not able to recreate any of the Whisper training data in any way.

leereeves
3 replies
3h30m

The readme says "We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications."

Is that less certain than the quote implies?

doctorpangloss
1 replies
1h4m

Laypeople value the aesthetics of statements like these. It's very Discord energy.

Everyone using learned weights from other models, especially ones released by OpenAI, Stability and Google, such as text and audio encoders, is tainted by training materials that were not expressly licensed for the purpose of AI model training or unlimitedly licensed for any use.

jpcl
0 replies
42m

That's true but you make it sound like it's totally obvious where the line of fair use should be drawn for AI training.

Until courts or lawmakers make it clearer I personally believe non-generative models (Whisper, ResNet, DINOv2) should be legally trainable on publicly released data. Generative models (image or video generation, TTS, LLMs?) should be held to a much higher scrutiny since their outputs can potentially compete the creators who put a lot of creativity into their art. That not true for an ImageNet-trained classification model or Whisper ASR.

jpcl
0 replies
1h18m

We are working hard to uphold all the licensing rules but nobody can absolve you from all legal risks.

There may be a court ruling/new law that any training needs a special permission from the original author and then even a CC-BY license won't cover this.

3abiton
3 replies
11h15m

They don't mention the ability to add custom voices to the speech output, I wonder if that's a feature thatbwould be supported

atwrk
2 replies
10h39m

They do mention voice cloning in the README ("We’ve also added an example of voice cloning based on a reference audio file."), do you have something different in mind?

addandsubtract
1 replies
7h17m

It's only found in the Colab and not in the Readme, though. The examples in the Colab are also better than the ones found in the Readme. Maybe the Readme still needs to be updated?

jpcl
0 replies
6h57m

Hi, thanks a lot for the tip, I'll update the README samples ASAP. :)

I was busy working on inference performance in the last few weeks and totally did not expect to land on Hackernews today. Only noticed it an hour ago because my GitHub stars jumped quite a bit

lolinder
8 replies
13h55m

I know it's old at this point and doesn't use the fancy new tech, but Mycroft's Mimic 3 is still pretty impressive and is small enough to fit comfortably and generate speech in real time on a raspberry pi [0]. Some of their voices are better than others, but the best of them are definitely equal to the examples of WhisperSpeech given here.

[0] https://mycroft.ai/mimic-3/

jpcl
2 replies
7h4m

Yeah, the Mimic is a lot less resource intensive. We are working to improve WhisperSpeech in this regard but it's probably always going to require more compute (but in return you'll get higher quality).

That said if you have a modern NVidia GPU you should be able to run a voice-bot in real-time with WhisperSpeech.

freedomben
1 replies
2h28m

Will something like whisper.cpp be possible for whisper speech?

jpcl
0 replies
48m

We looked at this at one point and it seems whisper.cpp/llama.cpp have all the bits needed to make it work. I'd love to help if someone wanted to give it a shot.

follower
2 replies
12h1m

If you're not already aware, the primary developer of Mimic 3 (and its non-Mimic predecessor Larynx) continued TTS-related development with Larynx and the renamed project Piper: https://github.com/rhasspy/piper

Last year Piper development was supported by Nabu Casa for their "Year of Voice" project for Home Assistant and it sounds like Mike Hansen is going to continue on it with their support this year.

lolinder
0 replies
2h59m

Wow, I did not know that! Thank you!

IshKebab
0 replies
5h23m

That is much better quality than Mimic (which didn't sound very close to WhisperSpeech to me).

boxed
1 replies
11h14m

The "English US" voice sounds more Scottish than American to me :P

rcarmo
0 replies
10h32m

Which might be a good thing, nae, laddie?

odiroot
5 replies
8h0m

The Polish sample is really good. Sounds like an audiobook recording.

jpcl
4 replies
7h0m

Both Polish and English samples are actually synthesized with a voice trained on the WolneLektury audiobooks. They are the highest quality open source (CC BY-SA) audiobooks I could find.

By using the Whisper-derived phonetic representation (so called semantic tokens) we successfully trained a model with just a high-quality speech dataset of one language and the voice quality transferred to English.

satvikpendem
1 replies
5h57m

How much training compute does it require to train from scratch? I'm wondering because I have a lot of audiobooks, they're not necessarily CC licensed though but for my private usage and training I think it'd be fine.

jpcl
0 replies
5h13m

Training the T2S model from scratch takes around 8h on 96 A100 GPUs. Training the `tiny` S2A model is around 3x faster (training HQ `small` variant is comparable to T2S).

I think you would get good results with fine-tuning but unfortunately we don't have a user-friendly notebook or script to do that right now. The biggest model is 800MB (FP32) so you won't even need a very big GPU to be able to fine-tune.

e12e
1 replies
2h28m

Link to these in English? I found some hits that may be correct for Polish - but I'm guessing they're hosted somewhere canonical?

jpcl
0 replies
1h11m

https://wolnelektury.pl/katalog/audiobooki/ is the Polish audiobook collection.

The English audiobooks are public domain recordings from LibriVox (via the LibriLight dataset).

nickmcc
4 replies
15h54m

I was looking at video on training a custom voice with Piper, following a tutorial at https://www.youtube.com/watch?v=b_we_jma220, and noticed how the datasets required metadata of the text for the source audio files. This training method by Collabora seems to automate that process and only requires an audio file for training.

jpcl
2 replies
7h7m

Yup, we are using Whisper to transcribe automatically so we can train the model on just speech recordings, without human transcripts.

This works for any language that is well supported by the OpenAI Whisper model.

deskamess
1 replies
6h41m

Where can we find the latest OpenAI language model rankings?

jpcl
0 replies
5h10m

There is a plot of language performance on their repo: https://github.com/openai/whisper

I am not aware of a multi-lingual leaderboard for speech recognition models.

gmerc
0 replies
13h51m

Whisper solves it, that’s its purpose.

londons_explore
4 replies
2h33m

The first demo on that page was trained from a 32kbps crappy sound quality clip of winston churchill...?

garbage in, garbage out?

stavros
3 replies
1h55m

Must have been, it sounds very much like the quality of the "we shall fight on the beaches" speech.

A bit unfortunate choice for a demo, sadly.

jpcl
2 replies
11m

Good point, thanks. And I was thinking it will show that the model can really synthesize very varied samples... ;)

stavros
1 replies
7m

It does, but maybe put it last!

jpcl
0 replies
4m

Did exactly that, thanks for spotting that. :)

https://github.com/collabora/WhisperSpeech/commit/398b889060...

dale_glass
4 replies
8h39m

How tunable is the voice?

I'm interested in applying TTS to a chat system, and one important feature for that is that there should be as many as possible distinct voices, so that each person would have their own.

Would this, or something else be able to do that?

jpcl
2 replies
7h16m

We support voice cloning so you can mimic the sound of any real voice (or try to create random ones). The prosody/emotions are more difficult to control right now but we are looking into this.

To check how this works in practice you can check the Google Collab link, at the end we are cloning the voice from a Churchill's speech over radio.

dale_glass
1 replies
6h28m

Sounds excellent! What are the requirements to run this regarding hardware? How much VRAM? Does it work on AMD or Intel Arc?

jpcl
0 replies
3h59m

Both models are using around 3GB right now (converted into FP16 for speed). But I checked that the (slower) FP32 version uses 2.3GB so we are probably doing something suboptimal here.

We support CUDA right now although it should not be too hard to port it to whisper/llama.cpp or Apple's MLX. It's a pretty straightforward transformer architecture.

genpfault
0 replies
3h15m

applying TTS to a chat system

John Madden![1]

[1]: https://knowyourmeme.com/memes/moonbase-alpha-text-to-speech

WhackyIdeas
3 replies
12h13m

Can it run local only?

jpcl
2 replies
7h14m

Yes, on a consumer 4090 card it's 12x faster than real-time. We'll benchmark some older cards as well for comparison.

I think it should work pretty good with the Apple's MLX framework as well if anyone would be willing to convert it. :)

WhackyIdeas
1 replies
4h20m

And totally private, as in no internet needed?

jpcl
0 replies
3h55m

Yes, you download the weights once from Huggingface and you can do whatever you want with it. :) We have no cloud APIs or usage tracking of any kind.

zerop
2 replies
14h1m

Can this be run on Mac M1?

abathur
1 replies
13h20m

Idk if it would out of the box, but it should be possible. I know that Whisper (and some variants) run on both x86 and silicon macs.

jpcl
0 replies
7h11m

It should run with PyTorch but the performance might not be great. As a long-time Mac user myself I would love if someone would send a PR to port it to MLX :)

jpcl
2 replies
6h51m

Hi, WhisperSpeech dev here.

Thanks for all the nice comments, I was working really hard on this model for quite a few months now but there are still a lot of ways we can make it better.

Thanks to generosity of Collabora this is a real open-source project (not just a one-time marketing ploy), so if you want to help improve it or integrate it into something you are building, I'd love to help.

You can also buy our undivided engineering attention if you have a business use-case. :)

dale_glass
1 replies
3h46m

Thanks to generosity of Collabora this is a real open-source project (not just a one-time marketing ploy), so if you want to help improve it or integrate it into something you are building, I'd love to help.

We're probably interested! We're Overte, an open source VR/Desktop social platform.

The system targets VR and voice chat primarily, but we want to be more accessible to people who can't use voice chat for any reason. We do have an integrated chat, but it's not an ideal experience in VR. So good TTS to make it integrate better would be great for us. And the possibility of doing this without some sort of commercial API that requires keeping a secret API key is huge.

So yeah, we're very much interested in giving this one a try. It will probably take some time as we're gearing up for FOSDEM now, though.

jpcl
0 replies
1h15m

Yeah, we'd love to help you when you decide to give it a try so feel free to reach out. We also have quite a few people working on VR at Collabora.

globalnode
2 replies
8h48m

this is the best tts ive heard, the voice modulates as you'd expect a human to.

pksebben
0 replies
1h57m

Not to step on any toes here (I've starred whisperspeech b/c it really is amazing and I intend to use it), but you should also check out Tortoise [1]. IMO the quality is a little better (for now) but it is painfully slow, even with KV caching it doesn't quite get up to real time on my 4090 except with very short snippets.

1 https://github.com/neonbjb/tortoise-tts

jpcl
0 replies
7h9m

Thanks a lot. :)

We are constantly working on these models and we push new versions every two months or so. It should get even better soon. :)

rhdunn
1 replies
42m

Is there any work/progress on a NN/trained model based on Internet Phonetic Alphabet (IPA) transcriptions? I.e. to be able to create an IPA transcription and convert it back to sound.

That approach would be useful for things like shifting a voice to a different accent and to support voices speaking multiple languages.

This can be done to a limited extend for models such as MBROLA voices by mapping the phonemes of one language to the phonemes of the MBROLA voice. MBROLA is more complex in that it supports diphones, and many diphone pairs don't exist, so you need to map 3 phonemes together to get the best matching phonetic transcription.

The IPA approach may also make it better to train the phonetic synthesis, given that the IPA vowels are in a formant continuum (similar to colour wheels and cubes). Then, the model could better learn the variations in voice quality and tambre.

jpcl
0 replies
13m

That's an interesting thought. The semantic tokens we get from Whisper serve a similar purpose – you can convert existing speech to different voices, I did not try with accents yet.

There is still a lot to explore in this space – we certainly don't have all the answers yet!

RockRobotRock
1 replies
10h40m

holy shit what. how?

Havoc
0 replies
6h36m

Same way the algos designed to spot cats in pictures can generate pictures of cats in reverse. Sorta

gbajson
0 replies
2h7m

Would anyone of you if it is trained to recognize geographical places, famous people, etc?