WhisperSpeech – An open source text-to-speech system built by inverting Whisper

Interested to see how it performs for Mandarin Chinese speech synthesis, especially with prosody and emotion. The highest quality open source model I've seen so far is EmotiVoice[0], which I've made a CLI wrapper around to generate audio for flashcards.[1] For EmotiVoice, you can apparently also clone your own voice with a GPU, but I have not tested this.[2]

[0] https://github.com/netease-youdao/EmotiVoice

[1] https://github.com/siraben/emotivoice-cli

[2] https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Clon...

Hi, WhisperSpeech dev here, we only support Polish and English at the moment but we just finished doing some inference optimizations and are looking to add more languages.

What we seem to need is high-quality speech recordings in any language (audiobooks are great) and some recordings for each target language which can be low-quality but need varied prosody/emotions (otherwise everything we generate will sound like an audiobook).

Last I checked, LibriVox had about 11 hours of Mandarin audiobooks and Common Voice has 234 validated hours of "Chinese (China)" (probably corresponding to Mandarin as spoken on the mainland paired with text in Simplified characters, but who knows) and 77 validated hours of "Chinese (Taiwan)" (probably Taiwanese Mandarin paired with Traditional characters).

Not sure whether that's enough data for you. (If you need paired text for the LibriVox audiobooks, I can provide you with versions where I "fixed" the original text to match the audiobook content e.g. when someone skipped a line.)

For Polish I have around 700hr. I suspect that we will need less hours if we add more languages since they do overlap to some extent.

Fixed transcripts would be nice although we need to align them with the audio really precisely (we cut the audio into 30 second chunks and we pretty much need to have the exact text in every chunk). It seems this can be solved with forced alignment algorithms but I have not dived into that yet.

I have forced alignments, too.

E.g. for the True Story of Ah Q https://github.com/Yorwba/LiteratureForEyesAndEars/tree/mast... .align.json is my homegrown alignment format, .srt are standard subtitles, .txt is the text, but note that in some places I have [[original text||what it is pronounced as]] annotations to make the forced alignment work better. (E.g. the "." in LibriVox.org, pronounced as 點 "diǎn" in Mandarin.) Oh, and cmn-Hans is the same thing transliterated into Simplified Chinese.

The corresponding LibriVox URL is predictably https://librivox.org/the-true-story-of-ah-q-by-xun-lu/

Thanks, I'll check it out. I don't know any Chinese so I'll probably reach out to you for some help :)

Sure, feel free to email me at the address included in every commit in the repo.

Librivox seems like a great source, being public domain, though the quality is highly variable.

I can recommend Elizabeth Klett as a good narrator. I've sampled her recordings of Jane Austen books Emma, pride and prejudice, and sense and sensibility.

Have you released your flashcard app?

If you're interested, I have a small side project (https://imaginanki.com) for generating Anki decks with images + speech (via SDXL/Azure).

Some language learning resources From "Show HN: Open-source tool for creating courses like Duolingo" (2023) https://news.ycombinator.com/item?id=38317345 :

ENH: Generate Anki decks with {IPA symbols, Greek letters w/ LaTeX for math and science,

Not OP, but I develop Mochi [0] which is a spaced repetition flash card app that has text-to-speech and a bunch of other stuff built in (transcription, dictionaries, etc.) that you might be interested in.

[0] https://mochi.cards

What spaced repetition algorithm does it use?

It's just an Anki deck.

Did you try XTTS v2 for Mandarin? I'm curious how it compares with EmotiVoice.

It has a big problem with hallucination in Chinese, random extra syllables all over the place.

Makes sense, I get hallucinations in English too.

Just listened to the demo voices for EmotiVoice and WhisperSpeech. I think WhisperSpeech edges out EmotiVoice. EmotiVoice sounds like it was trained on English spoken by non-native speakers.

Aside: is it just me, or is anyone else just as dumbfounded with how quickly literally every aspect of AI and LLMs and Models and blah blah blah is going?

Am I weird in just having my head spin - even though I've also been at leading edge tech before, but this is just me yelling at these new algos on my lawn?

on the contrary I'm really disappointed in how long its taking anything to get into production.

Whisper and self-hostable LLMs had a cambrian explosion about 1 year ago, I attended a GPT4 hackathon last March and in 48 hours saw people hook up Speech2Text -> LLM -> Text2Speech pipelines for their live demos. I thought we would all have babelfish by June.

Months later I later attended some conferences with international speakers that really wanted to have live, translated-on-the-fly captions, but there wasn't anything off the shelf they could use. I found a helpful repo to use whisper with rolling transcription but struggled to get the python prerequisites installed (involving hardlinking to a tensorflow repo for my particular version of m1 CPU). It was humbling and also hype-busting to realize that it takes time to productize, and that the LLMs are not magic that can write these applications themselves.

In the meantime even Google hasn't bothered to run the improved transcription models on YouTube videos. They are still old 80% accurate tech that's useless on anyone with an accent.

I built this last March. It captures audio from a live HLS stream and transcribes and translates into 18 languages on the fly. Used by a customer with about 25K international employees for their internal events. Works surprisingly well.

Fabulous, guess that's the other part of productizing: a paying customer!

on the contrary I'm really disappointed in how long its taking anything to get into production.

I agree. I was thinking about making a Jarvis like bot which should be pretty easy at this point. The main problem was that my iPhone doesn’t easily allow for pressing a button upon which it starts listening. You always need to unlock first at which the whole screen gets unlocked too. Maybe these kind of GUI-focussed interfaces are blocking a lot of ideas? At the same time it’s great that people will come up with new devices and these will compete somewhat with phones.

The tap on back might work without unlock and i think that can be set to a custom shortcut

I'd be interested if you ever dig anything up for this. I hacked together a kind of crude tool to snapshot audio and translate / caption it on the fly:

https://captioner.richardson.co.nz/

I would very much like to improve on this but the live translation / captioning still has some more work to go in this space.

Source was here: https://github.com/Rodeoclash/captioner

I was going to suggest considering looking into vosk but... clearly that suggestion isn't very useful to you. :)

Check out WhisperLive: https://github.com/collabora/WhisperLive

If you're grappling with the slow march from cool tech demos to real-world language model apps, you might wanna check out WhisperLive. It's this rad open-source project that’s all about leveraging Whisper models for slick live transcription. Think real-time, on-the-fly translated captions for those global meetups. It's a neat example of practical, user-focused tech in action. Dive into the details on their GitHub page

I'm really disappointed in how long its taking anything to get into production.

It was humbling and also hype-busting to realize that it takes time to productize

Yep, looks like you found out why it’s taking so long to get this new tech into production. The gap between nothing and a proof of concept is, in some ways, much smaller than the gap between proof of concept and commercial product.

You're focuded on whisper/voice stuff...

I was making a more general statement... I havent even had time to personally look at any voice stuff...

Too many Shiny Things and too much ADHD in the koolaide.

I have a similar frustration with the lack of tooling around all this stuff.

Like, you had the time to train a bajillion parameter model with a ton of attendant code, but an installation script was a bridge too far. I get that python dependency management sucks, but you had to do it at least once for yourself.

Of course, here I am reinstalling CUdnn for the umpteenth time because this software is provided free of charge and it sprinkles magical fairy dust on my GPU so perhaps I shouldn't whine about it.

This is the structure of revolutions, particularly of this kind. Exponential growth looks like this.

In particular with the generation / recognition abilities of ML models, they have this feature of being a curiosity but not quite useful... so if a speech recognition program goes from 50% accuracy to 75% accuracy it's a huge accomplishment but the program is still approximately as useless when it's done. Going from 98% to 99% accuracy on the other hand still cuts the errors in half, but it's super impressive going from something that's useful but makes mistakes to making half as many mistakes. Once you hit the threshold of minimum usefulness the exponential growth seems like it's sudden and amazing when it's actually been going on for a long time.

At the same time, we've had a few great improvements in methodology with how models are designs (like transformers) and the first iterations showed how impressive things could be but were full of inefficiencies and we're watching those go away rather quickly.

structure of revolutions

For anyone who hasn't heard of it, this phrase is a reference to the theory of paradigm shifts in scientific progress, introduced in the book "The Structure of Scientific Revolutions" by Thomas Kuhn.

https://en.wikipedia.org/wiki/The_Structure_of_Scientific_Re...

What’s the text to speech generator that chatGPT uses? It’s the most impressive one I’ve heard so far.

If you think OpenAI's TTS is impressive, you should check out Eleven Labs. They have the highest quality models IMO. Voice quality, emotional awareness / inflection and support for foreign languages are top-notch, it's that last point that OpenAI seems to have the most issues with. If you find a good voice to clone, the latest models can even replicate somewhat unusual accents and speaking styles.

For plain old English TTS with a stock voice, there isn't that much of a difference (although Eleven Labs still wins IMO), but if you need either voice cloning or foreign language support, nothing else comes even close.

With that said, Eleven is extremely pricy, something like Azure TTS (which is the best among the cheap options) may be a better fit for less demanding applications.

The quality difference between Eleven and OpenAI is IMO pretty small, but the price difference is enormous: for 50,000 characters (approx 1hr of audio, by Eleven's estimates), you'd pay Eleven Labs $9 assuming you're in their highest $330/month payment commitment tier; for OpenAI there's no minimum commitment and the same number of characters would cost $0.75.

If you're generating speech once and replaying it many times (e.g. making podcasts), the difference is negligible and you might as well go with Eleven Labs, since it's more customizable and possibly slightly higher quality. If you're doing interactive speech with customers, $9/hr is incredibly expensive (higher than hiring a minimum-wage worker in the U.S.!), and OpenAI's TTS is a very close second best and much more reasonably priced. If you're trying to integrate speech into an AI product, Eleven makes your hourly costs pretty unfeasible since you have to at minimum charge your customers more than it costs to hire a human being to do a task.

Azure's "Neural" line of TTS is the best of the big cloud offerings, but it's pretty mediocre compared to either OpenAI or Eleven Labs IMO. And it's actually more expensive than using OpenAI: it's $0.80 for 50,000 characters (~1hr), unless you're willing to commit to over $1k monthly spend, at which point it's barely cheaper than OpenAI at $0.64 per 50k characters.

OpenAI's TTS is IMO the best option for anything interactive, since it's so much higher quality than Azure's Neural TTS and so much cheaper (with very little quality difference) as compared to Eleven Labs.

For anyone reading, in case you want a whole order of magnitude cheaper, just go with Google Cloud TTS. For many voices, you get 1 million characters free per month, and even beyond that it's ridiculously cheap. Some voices do sound artificial, but many sound quite human - the only tells are the relatively consistent tone and section ends (no appropriate pauses).

I don't read long articles any more. I have a script that extracts the text, does TTS via Google Cloud, and adds it to my podcast so I can listen to it while driving. Been doing this for months and haven't paid a cent.

That's a good suggestion, thank you. Would it be possible to post some code? I've found GCP's APIs/documentation to be a bit abstruse.

Actually, all the code related to this was copied from the docs, almost verbatim.

(But yes, their docs suck).

None of their available voices are as good as ms

ms?

Maybe I’m not a good judge but OpenAI’s voices sound very natural to me and seem better than Eleven labs.

They use their own models, and we don't know anything about their architecture (I believe), but you can use them with the OpenAI API.

you could make an informed guess by looking at what the highest quality open source model is, looking at the current employer of that model's creator, and what they currently work on there

OpenAI’s own models. They’re also available commercially via API and are pretty affordable.

I've been following jpc [0] on the LAION discord since he started building this last year, and it's a very impressive project.

The key here is that the Whisper multilingual ASR model has been trained on a huge amount of data, so its encoder output is a very good representation of the semantic content of speech. This can be used as an open-source, drop-in replacement for the semantic encoder in model architectures like SPEAR-TTS/VALL-E/etc (whose semantic encoders are not publicly available). This is then used to predict acoustic tokens (the output from the quantized/low-bandwidth Encodec audio codec) which is then upsampled/denoised/enhanced with the Vocos vocoder.

I know someone is working on Hindi but it would be great to see this extended to other languages for a properly open-source [1], multilingual TTS platform. I think the main bottleneck at the moment is finding people who can procure/clean compliant datasets.

[0] https://github.com/jpc [1] jpc/Collabora went to great efforts to ensure that they are only using properly licensed data to train this. I doubt Whisper itself was that compliant, so it's a bit muddy.

Yeah, Whisper is not clear-cut but since it is not a generative model I think their data usage is a lot more likely to be considered fair-use.

And the part of that which we use for WhisperSpeech is just the phonetic representation so our model is not able to recreate any of the Whisper training data in any way.

The readme says "We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications."

Is that less certain than the quote implies?

Laypeople value the aesthetics of statements like these. It's very Discord energy.

Everyone using learned weights from other models, especially ones released by OpenAI, Stability and Google, such as text and audio encoders, is tainted by training materials that were not expressly licensed for the purpose of AI model training or unlimitedly licensed for any use.

That's true but you make it sound like it's totally obvious where the line of fair use should be drawn for AI training.

Until courts or lawmakers make it clearer I personally believe non-generative models (Whisper, ResNet, DINOv2) should be legally trainable on publicly released data. Generative models (image or video generation, TTS, LLMs?) should be held to a much higher scrutiny since their outputs can potentially compete the creators who put a lot of creativity into their art. That not true for an ImageNet-trained classification model or Whisper ASR.

We are working hard to uphold all the licensing rules but nobody can absolve you from all legal risks.

There may be a court ruling/new law that any training needs a special permission from the original author and then even a CC-BY license won't cover this.

They don't mention the ability to add custom voices to the speech output, I wonder if that's a feature thatbwould be supported

They do mention voice cloning in the README ("We’ve also added an example of voice cloning based on a reference audio file."), do you have something different in mind?

It's only found in the Colab and not in the Readme, though. The examples in the Colab are also better than the ones found in the Readme. Maybe the Readme still needs to be updated?

Hi, thanks a lot for the tip, I'll update the README samples ASAP. :)

I was busy working on inference performance in the last few weeks and totally did not expect to land on Hackernews today. Only noticed it an hour ago because my GitHub stars jumped quite a bit

I know it's old at this point and doesn't use the fancy new tech, but Mycroft's Mimic 3 is still pretty impressive and is small enough to fit comfortably and generate speech in real time on a raspberry pi [0]. Some of their voices are better than others, but the best of them are definitely equal to the examples of WhisperSpeech given here.

[0] https://mycroft.ai/mimic-3/

Yeah, the Mimic is a lot less resource intensive. We are working to improve WhisperSpeech in this regard but it's probably always going to require more compute (but in return you'll get higher quality).

That said if you have a modern NVidia GPU you should be able to run a voice-bot in real-time with WhisperSpeech.

Will something like whisper.cpp be possible for whisper speech?

We looked at this at one point and it seems whisper.cpp/llama.cpp have all the bits needed to make it work. I'd love to help if someone wanted to give it a shot.

If you're not already aware, the primary developer of Mimic 3 (and its non-Mimic predecessor Larynx) continued TTS-related development with Larynx and the renamed project Piper: https://github.com/rhasspy/piper

Last year Piper development was supported by Nabu Casa for their "Year of Voice" project for Home Assistant and it sounds like Mike Hansen is going to continue on it with their support this year.

Wow, I did not know that! Thank you!

That is much better quality than Mimic (which didn't sound very close to WhisperSpeech to me).

The "English US" voice sounds more Scottish than American to me :P

Which might be a good thing, nae, laddie?

The Polish sample is really good. Sounds like an audiobook recording.

Both Polish and English samples are actually synthesized with a voice trained on the WolneLektury audiobooks. They are the highest quality open source (CC BY-SA) audiobooks I could find.

By using the Whisper-derived phonetic representation (so called semantic tokens) we successfully trained a model with just a high-quality speech dataset of one language and the voice quality transferred to English.

How much training compute does it require to train from scratch? I'm wondering because I have a lot of audiobooks, they're not necessarily CC licensed though but for my private usage and training I think it'd be fine.

Training the T2S model from scratch takes around 8h on 96 A100 GPUs. Training the `tiny` S2A model is around 3x faster (training HQ `small` variant is comparable to T2S).

I think you would get good results with fine-tuning but unfortunately we don't have a user-friendly notebook or script to do that right now. The biggest model is 800MB (FP32) so you won't even need a very big GPU to be able to fine-tune.

Link to these in English? I found some hits that may be correct for Polish - but I'm guessing they're hosted somewhere canonical?

https://wolnelektury.pl/katalog/audiobooki/ is the Polish audiobook collection.

The English audiobooks are public domain recordings from LibriVox (via the LibriLight dataset).

I was looking at video on training a custom voice with Piper, following a tutorial at https://www.youtube.com/watch?v=b_we_jma220, and noticed how the datasets required metadata of the text for the source audio files. This training method by Collabora seems to automate that process and only requires an audio file for training.

Yup, we are using Whisper to transcribe automatically so we can train the model on just speech recordings, without human transcripts.

This works for any language that is well supported by the OpenAI Whisper model.

Where can we find the latest OpenAI language model rankings?

There is a plot of language performance on their repo: https://github.com/openai/whisper

I am not aware of a multi-lingual leaderboard for speech recognition models.

Whisper solves it, that’s its purpose.

The first demo on that page was trained from a 32kbps crappy sound quality clip of winston churchill...?

garbage in, garbage out?

Must have been, it sounds very much like the quality of the "we shall fight on the beaches" speech.

A bit unfortunate choice for a demo, sadly.

Good point, thanks. And I was thinking it will show that the model can really synthesize very varied samples... ;)

It does, but maybe put it last!

Did exactly that, thanks for spotting that. :)

https://github.com/collabora/WhisperSpeech/commit/398b889060...

How tunable is the voice?

I'm interested in applying TTS to a chat system, and one important feature for that is that there should be as many as possible distinct voices, so that each person would have their own.

Would this, or something else be able to do that?

We support voice cloning so you can mimic the sound of any real voice (or try to create random ones). The prosody/emotions are more difficult to control right now but we are looking into this.

To check how this works in practice you can check the Google Collab link, at the end we are cloning the voice from a Churchill's speech over radio.

Sounds excellent! What are the requirements to run this regarding hardware? How much VRAM? Does it work on AMD or Intel Arc?

Both models are using around 3GB right now (converted into FP16 for speed). But I checked that the (slower) FP32 version uses 2.3GB so we are probably doing something suboptimal here.

We support CUDA right now although it should not be too hard to port it to whisper/llama.cpp or Apple's MLX. It's a pretty straightforward transformer architecture.

applying TTS to a chat system

John Madden![1]

[1]: https://knowyourmeme.com/memes/moonbase-alpha-text-to-speech

Can it run local only?

Yes, on a consumer 4090 card it's 12x faster than real-time. We'll benchmark some older cards as well for comparison.

I think it should work pretty good with the Apple's MLX framework as well if anyone would be willing to convert it. :)

And totally private, as in no internet needed?

Yes, you download the weights once from Huggingface and you can do whatever you want with it. :) We have no cloud APIs or usage tracking of any kind.

Can this be run on Mac M1?

Idk if it would out of the box, but it should be possible. I know that Whisper (and some variants) run on both x86 and silicon macs.

It should run with PyTorch but the performance might not be great. As a long-time Mac user myself I would love if someone would send a PR to port it to MLX :)

Hi, WhisperSpeech dev here.

Thanks for all the nice comments, I was working really hard on this model for quite a few months now but there are still a lot of ways we can make it better.

Thanks to generosity of Collabora this is a real open-source project (not just a one-time marketing ploy), so if you want to help improve it or integrate it into something you are building, I'd love to help.

You can also buy our undivided engineering attention if you have a business use-case. :)

Thanks to generosity of Collabora this is a real open-source project (not just a one-time marketing ploy), so if you want to help improve it or integrate it into something you are building, I'd love to help.

We're probably interested! We're Overte, an open source VR/Desktop social platform.

The system targets VR and voice chat primarily, but we want to be more accessible to people who can't use voice chat for any reason. We do have an integrated chat, but it's not an ideal experience in VR. So good TTS to make it integrate better would be great for us. And the possibility of doing this without some sort of commercial API that requires keeping a secret API key is huge.

So yeah, we're very much interested in giving this one a try. It will probably take some time as we're gearing up for FOSDEM now, though.

Yeah, we'd love to help you when you decide to give it a try so feel free to reach out. We also have quite a few people working on VR at Collabora.

this is the best tts ive heard, the voice modulates as you'd expect a human to.

Not to step on any toes here (I've starred whisperspeech b/c it really is amazing and I intend to use it), but you should also check out Tortoise [1]. IMO the quality is a little better (for now) but it is painfully slow, even with KV caching it doesn't quite get up to real time on my 4090 except with very short snippets.

1 https://github.com/neonbjb/tortoise-tts

Thanks a lot. :)

We are constantly working on these models and we push new versions every two months or so. It should get even better soon. :)

Is there any work/progress on a NN/trained model based on Internet Phonetic Alphabet (IPA) transcriptions? I.e. to be able to create an IPA transcription and convert it back to sound.

That approach would be useful for things like shifting a voice to a different accent and to support voices speaking multiple languages.

This can be done to a limited extend for models such as MBROLA voices by mapping the phonemes of one language to the phonemes of the MBROLA voice. MBROLA is more complex in that it supports diphones, and many diphone pairs don't exist, so you need to map 3 phonemes together to get the best matching phonetic transcription.

The IPA approach may also make it better to train the phonetic synthesis, given that the IPA vowels are in a formant continuum (similar to colour wheels and cubes). Then, the model could better learn the variations in voice quality and tambre.

That's an interesting thought. The semantic tokens we get from Whisper serve a similar purpose – you can convert existing speech to different voices, I did not try with accents yet.

There is still a lot to explore in this space – we certainly don't have all the answers yet!

holy shit what. how?

Same way the algos designed to spot cats in pictures can generate pictures of cats in reverse. Sorta

Would anyone of you if it is trained to recognize geographical places, famous people, etc?