Interested to see how it performs for Mandarin Chinese speech synthesis, especially with prosody and emotion. The highest quality open source model I've seen so far is EmotiVoice[0], which I've made a CLI wrapper around to generate audio for flashcards.[1] For EmotiVoice, you can apparently also clone your own voice with a GPU, but I have not tested this.[2]
[0] https://github.com/netease-youdao/EmotiVoice
[1] https://github.com/siraben/emotivoice-cli
[2] https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Clon...
Hi, WhisperSpeech dev here, we only support Polish and English at the moment but we just finished doing some inference optimizations and are looking to add more languages.
What we seem to need is high-quality speech recordings in any language (audiobooks are great) and some recordings for each target language which can be low-quality but need varied prosody/emotions (otherwise everything we generate will sound like an audiobook).
Last I checked, LibriVox had about 11 hours of Mandarin audiobooks and Common Voice has 234 validated hours of "Chinese (China)" (probably corresponding to Mandarin as spoken on the mainland paired with text in Simplified characters, but who knows) and 77 validated hours of "Chinese (Taiwan)" (probably Taiwanese Mandarin paired with Traditional characters).
Not sure whether that's enough data for you. (If you need paired text for the LibriVox audiobooks, I can provide you with versions where I "fixed" the original text to match the audiobook content e.g. when someone skipped a line.)
For Polish I have around 700hr. I suspect that we will need less hours if we add more languages since they do overlap to some extent.
Fixed transcripts would be nice although we need to align them with the audio really precisely (we cut the audio into 30 second chunks and we pretty much need to have the exact text in every chunk). It seems this can be solved with forced alignment algorithms but I have not dived into that yet.
I have forced alignments, too.
E.g. for the True Story of Ah Q https://github.com/Yorwba/LiteratureForEyesAndEars/tree/mast... .align.json is my homegrown alignment format, .srt are standard subtitles, .txt is the text, but note that in some places I have [[original text||what it is pronounced as]] annotations to make the forced alignment work better. (E.g. the "." in LibriVox.org, pronounced as 點 "diǎn" in Mandarin.) Oh, and cmn-Hans is the same thing transliterated into Simplified Chinese.
The corresponding LibriVox URL is predictably https://librivox.org/the-true-story-of-ah-q-by-xun-lu/
Thanks, I'll check it out. I don't know any Chinese so I'll probably reach out to you for some help :)
Sure, feel free to email me at the address included in every commit in the repo.
Librivox seems like a great source, being public domain, though the quality is highly variable.
I can recommend Elizabeth Klett as a good narrator. I've sampled her recordings of Jane Austen books Emma, pride and prejudice, and sense and sensibility.
Have you released your flashcard app?
If you're interested, I have a small side project (https://imaginanki.com) for generating Anki decks with images + speech (via SDXL/Azure).
Some language learning resources From "Show HN: Open-source tool for creating courses like Duolingo" (2023) https://news.ycombinator.com/item?id=38317345 :
Not OP, but I develop Mochi [0] which is a spaced repetition flash card app that has text-to-speech and a bunch of other stuff built in (transcription, dictionaries, etc.) that you might be interested in.
[0] https://mochi.cards
What spaced repetition algorithm does it use?
It's just an Anki deck.
Did you try XTTS v2 for Mandarin? I'm curious how it compares with EmotiVoice.
It has a big problem with hallucination in Chinese, random extra syllables all over the place.
Makes sense, I get hallucinations in English too.
Just listened to the demo voices for EmotiVoice and WhisperSpeech. I think WhisperSpeech edges out EmotiVoice. EmotiVoice sounds like it was trained on English spoken by non-native speakers.