Common Voice

FF's TTS is an important project for anyone who wants a trivial to use text-to-speech system. It's built into the browser so you can just run

    wss = window.speechSynthesis;
    for (let i = 0; i < wss.getVoices().length; ++i){
      str = `Voice ${i} is ${wss.getVoices()[i].name}`;
      s = new SpeechSynthesisUtterance(str);
      s.voice = wss.getVoices()[i];
      wss.speak(s);
      console.log(str);
    }

 in the console to get various TTS examples. For some browsers, this can be done offline while others use a cloud based TTS system.

Is there a handy demo website somewhere to access that?

I extracted Narrator module from Firefox'es reader mode. It's not so good in other browsers though. On macOS, I'm using Alex voice.

https://tts.cns.wtf/

https://github.com/python273/tts-app

I've tried this on Linux+Firefox, but it doesn't sound very good yet, I'm afraid.

Yeah, it's using OS's voices and usually they are quite robotic. But it's still quite useful if you want to upload a long text into your brain.

I'm also using Narrator in my HN reader (no settings for voice right now though, except localStorage['cfg-narrator-rate'] via devtools):

https://hn.cns.wtf/#38532761

This also works in Chrome (My version is: 119.0.6045.199)

FF has 8611 voices, chrome has 19.

That's odd, my Chrome (119.0.6045.199) has 176 voices. Not all are English though.

Maybe it's because I'm linux? (Pop!_OS 22.04 LTS)

Also I have 3 English only.

Do you know if it’s been extracted into a standalone library? The state of the open source TTS seems to not be great. Presumably the data for a voice is harder to put together than training a speech to text system like whisper.

The voices don't come from the browsers themselves, but from operating systems and their underlying TTS APIs, SAPI on Windows, Speech Dispatcher on Linux and AVSpeechSynthesizer on Apple Devices. If you install a third-party voice compatible with one of these, the browsers will pick that up.

This is a reasonably good open source local TTS that's fast enough to use in home automation: https://github.com/rhasspy/piper

On macOS, it's

    say "enter text here"

To pick a different voice:

    say -v Fred "enter text here"

To list voices:

    say -v "?"

(The quoting is necessary to prevent ZSH from interpreting the question mark as a glob.)

I hear Firefox's TTL is important, yet prior to your comment I didn't even know it existed. This sort of stuff should be more discoverable, and have a more accessible (ahem) API.

It's part of the web apis, it's not just firefox. Chrome and Safari have supported it since 2013/2014.

It looks like speechSynthesis is supported in all the major browsers, not just FF. https://developer.mozilla.org/en-US/docs/Web/API/Window/spee...

This is handy to know, thanks. I was just trying out Common Voice a few days ago.

They have a good example of a community page for folks wanting to help with a particular language.

I was just thinking today that Firefox is worthy of switching back to because it was so fast,except I hadn't had a chance to do it.

If anyone else thinks it's important for there to be an independent browser dedicated to privacy and security (and independence), they could as many casual browser switchers. I'm happy to be back on a few FF extension that didn't work quite the same on any chrome based browser.

Nice! For debugging I've been having a lot of fun supplementing stderr (for especially important messages that I don't want to miss) with the free TTS voices available on Windows (by running a Powershell command) and from Chrome (via WebSocket). It's nice to have even more voices to choose from.

Pretty cool, it printed a list of over 8000 on my machine (Ubuntu, well Kubuntu now) and then proceeded to speak the voices after printing them all.

I submitted a request for Norwegian Bokmål, and realised a complication which I'm sure must affect other languages too:

Norway has two separate official languages. They are unusually close - one is relatively close to Danish, and the other started as a collection of dialects, but technically they are written languages, especially Bokmål which basically means "book language".

I'm unusual in that I speak close to "pure" Bokmål. Thanks to expectations at school etc., a lot of speakers who write Bokmål will adjust or tone down their dialect if asked to read a text that is written in grammatically and orthographically correct bokmål, but will otherwise speak in a manner that can deviate fairly significantly from the written language.

As such, depending on whether your goal is text to speech or speech recognition, the pronunciation you will need is very different.

E.g. people I know who write Bokmål might say something like "hva erredu ser på a?" ("what are you looking at?") with hardly any gaps between words, while I would stick close to the written "hva er det du ser på?" with clear gaps. In recognition you need to handle both (and many other variations), while for generation you'd at least by default usually want the latter unless there are indications the text is written in dialect.

It strikes me you'd really want people to write more detail about what it is they are speaking and/or let people tag/label data with additional info about accents. Not just for this, but for other multi-lingual speakers as well. E.g. it'd be helpful to have many foreign accents in the English (and other languages) dataset for recognition, but as much as I want speech recognition to understand me, I'm not particularly interested in teaching it to speak English with a strong Norwegian accent.

That is less of an issue than the dialects in some languages that can involve much more than just speaking the same words differently.

To take another example "Jeg åpnet døren og gikk ut i solen" og "Jeg åpna døra og gikk ut i sola" are both valid Bokmål. Depending on context a reader may stick strictly to the text or swap åpnet<->åpna, døren<->døra, sola<->sola, and every permutation is valid. Which exact set you use differs and some speakers will write one but use the other when speaking. E.g. I would say åpna, døra, sola, but write åpnet, døren, solen. The latter is more formal and/or old-fashioned in some parts of the country, but the perception of that also varies by region. And this totally leaves out all the dialect variations used by people who'd say their language is Bokmål, and would be recognized as such by Norwegian speakers, but who use variants of words or conjugations that aren't technically recognized as valid Bokmål.

The former is more "modern" (several of the forms are only valid Bokmål as a result of successive language reforms), more common in the Eastern part of Norway outside of the posher parts of Oslo and other wealthy regions, and (weirdly) more common in 1970's radical left-wing academics (especially people involved with the Maoist Workers Communist Party/AKP-ML) as an affectation/sociolect, with each of these groups also deviating in other aspects....

If you want to maximize the utility of a dataset like this, you really would want to let each speaker at least assign a lot of tags/labels to their profile; even if you don't want to deal with the hornet nest of trying to figure out all the distinctions, even unstructured labels would be a start, and ideally allowing people to tag individual recordings as well, because there are a lot more variations than just "language" and "accent" here.

Norwegian Bokmål

... is currently in progress. What's missing is a sufficiently complete translation of the UI https://pontoon.mozilla.org/projects/common-voice/ and a sufficiently large number of sentences for people to record https://commonvoice.mozilla.org/nb-NO/write

let each speaker at least assign a lot of tags/labels to their profile

Common Voice data files have columns for age, gender, accents, variant, locale and segment. (Not sure what that last one is.) These are per recording, but I'm pretty sure they're the same for all recordings by the same speaker.

Weird to hold off on adding a language because the UI isn't translated. Why would there be an assumption that the language people want to record is linked to preferred UI language?

I don't want Norwegian UI - I just want to be able to record Norwegian sentences. If the UI switches to Norwegian I'd be very annoyed, as I haven't indicated I want that and my browser settings specify English.

(I avoid Norwegian for UIs, because the translations are generally wildly inconsistent in how they translate key terms that I'm used to seeing in English, so it's a massive nuisance - when people assume UI and content language should be the same, that is a major failing to me)

Re: tags someone else pointed out the accent field is being used for this, even though the UI describes that as specifically for accents.

[comment removed]

Frankly this does seem like a massive barrier to me.

It's certainly causing me to lose interest, and I suspect it's driving away a lot of people, not least because it was not at all obvious to me there was some way of speeding up getting a language in the first place.

It was already off-putting not to be given a way to write sentences or record right away.

But now that I know, I have no interest in wasting time contributing to a UI translation I actively don't want to be subjected to, but would happily contribute recordings and sentences on occasion if the language was enabled because the potential for speech recognition and tts utility is entirely separate in value from UI.

This whole approach feels really backwards to me, and the really short list of languages no longer surprise me.

EDIT: I see I actually have had it bookmarked a long time, and presumably lost interest once before due to the lack of my language.

EDIT2: As much as the Norwegian UI is already annoying me and I've already spotted at least one spelling mistake in it, and one translation that is "correct" that thoroughly annoys me, I'll see if I can submit some sentences at least.

it was not at all obvious to me there was some way of speeding up getting a language in the first place.

Yeah, that's the biggest failing of Common Voice in my opinion. Getting a new language up to speed could be much improved by simply adding a few links to documentation, but even the existing links are broken, which I reported in March 2022... https://github.com/common-voice/common-voice/issues/3637

I have no interest in wasting time contributing to a UI translation I actively don't want to be subjected to

Translating the UI may still help you get other people to record, even if you don't want to use it yourself.

I'll see if I can submit some sentences at least

If you want to go faster, there's also a project to extract sentences from Wikipedia etc. in small doses Mozilla's lawyers and Wikimedia's lawyers have agreed are fair use. I think you'd only need to define how Norwegian Bokmål separates sentences. (E.g. after a period but not if it's a common abbreviation like "etc." in the preceding sentence.) https://github.com/Common-Voice/cv-sentence-extractor

Target segment was a was of including specific subdatasets. For example the digits dataset which was just the digits 0-9 and yes/no.

Are you autistic? I ask because this is HN where lots of people are, and choosing to speak the literary norm in countries with diglossia is often associated with autism. For example, foreigners in Finland are urged to quickly get to grips with puhekieli (spoken Finnish) because speaking kirjakieli (the literary norm) in everyday contexts, or writing it in chats, is “something only autistic people do”.

Not to my knowledge, though I may have some traits.

That said, in Norway the literary form is/was spoken on e.g. TV and radio similar to how RP (received pronunciation) is/was spoken on the BBC, more so (in both cases) before than now where dialects are more broadly tolerated. On top of that, in affluent areas of Western Oslo and adjoining affluent areas the dialect sits mostly within what is "allowed" in Bokmål, and actually mostly towards a more conservative end of the allowed range than where I sit, and it's somewhat political, in that more conservative forms of Bokmål historically tended to be associated with social status (or aspirations...).

It's unusual more in that the pockets and social groups where dialects that overlaps fully or almost entirely with Bokmål are fairly small.

My spoken dialect is within that spectrum, exacerbated by reading a lot of older literature at early age that used quite old fashioned forms of Bokmål, and picking up more formal language than many of my peers spoke through that, but I tend to be closer to the more affluent dialect in writing than spoken.

(EDIT: My spoken dialect would probably fit as a somewhat "posh" version of Urban East Norwegian[1] today, with somewhat more conservative word choices in places where contemporary Urban East Norwegian would have deviated from Bokmål in minor ways in the 70's and 80's by being somewhat more "relaxed" in ways that have since been accepted in subsequent adjustments of the rules)

If you heard me alongside my dad there'd be relatively minor differences between our dialects, and I'd probably sound marginally less formal as I adopted some spoken patterns from the more working class area I grew up in outside Oslo, while he at least when younger would be recognisable as having grown up on the Western edges of Oslo.

Beyond that, language has always fascinated me, and I tended to take a certain level of delight in torturing my Norwegian teacher who favoured the other official language - Nynorsk. Nynorsk and Bokmål overlaps very significantly, and more so after recent language reforms which have tended towards allowing more Nynorsk forms of words, or ones closer to them, in Bokmål. Our Norwegian teacher very much wanted us to use those forms (that'd be favouring "sola" over "solen" etc.), and I used to express my distaste for Nynorsk by instead exaggerating my preference for the more conservative Bokmål forms.

[1] https://en.wikipedia.org/wiki/Urban_East_Norwegian

Riksmål is the word you are looking for.

When I was growing up Riksmål was far more conservative than what I spoke despite the fact that I spoke fairly conservative Bokmål, and it was still somewhat more conservative than how I wrote. I've not paid much attention to Riksmål, but I'm vaguely aware they've moderated themselves quite a bit.

However a quick check with Det Norske Akademi's dictionary shows that both my spoken and written Norwegian is still not full match for Riksmål, though I see they've pretty much "surrendered" and even accepted some -a endings, so it's getting close-ish.

Maybe in another couple of decades.

This is a great argument.

I particularly agree with your point regarding English - my German accent sounds jarring to probably most native English speakers, but it should still be understood. To add to your argument, I have sometimes tried to turn on subtitles for Youtube videos in some accent of English that I haven't had much contact with (such as Nigerian English), but the auto-generated closed captions turned out to be even more useless than my own comprehension.

However, one should keep in mind that Mozilla's main goal here is accessibility, with the implication that they mean accessibility for blind and deaf people in particular - as opposed to accessibility for stunted multilinguals like us. For these purposes, being able to transcribe mainly mainstream uses of the language is fine, and so is being able to generate speech in a hodge-podge averaged dialect. I highly doubt most blind people care about whether their TTS engine speaks The Queen's English or not, as long as it is clear and understandable.

What is "clear and understandable" varies greatly, though. E.g. Nigerian English is often subtitled in the UK, but fairly often so is Scottish English... Both often to the great dismay of speakers of the two who sometimes are very annoyed at the expectation that people might not understand them.

Nigerian English is actually fascinating in that there's a whole spectrum from Nigerian Pidgin, which ranges from nearly unintelligible to English speakers, to "mostly British English" in terms of orthography and grammar, but which still tends to incorporate words from several differences Nigerian languages and pidgin. (e.g. abeg, don't give me any wahala; Please, don't give me any trouble)

Now consider Nigeria is about to become the country with the second largest number of English speakers worldwide (it's close to tied with India, depending which sources and level of proficiency you consider, and Nigeria's population is growing far faster than India's), and while it's still quite far behind the UK for people speaking it as their first language, with current population growth and increasing use of English (e.g. my ex wife's first language is English because her parents first languages were Igbo and Yoruba, and that kind of situation is driving adoption) likely to cause Nigeria to become the second largest on that measure as well.

So handling a broader range of dialects will matter, at least in terms of recognition - I do agree that there's more flexibility for generation, though even there if you try feed a broader Nigerian English pidgin to a TTS engine and it doesn't know what to do with the words it might well end up being unintelligible both to eg. American or British English speakers and Nigerian English speakers.

If you want to maximize the utility of a dataset like this, you really would want to let each speaker at least assign a lot of tags/labels to their profile; even if you don't want to deal with the hornet nest of trying to figure out all the distinctions, even unstructured labels would be a start, and ideally allowing people to tag individual recordings as well, because there are a lot more variations than just "language" and "accent" here.

This is exactly what the freeform accent (actually "variant") field is. You can add as many tags as you like. https://foundation.mozilla.org/en/blog/how-we-are-making-com...

Then the guidance on the site really needs to be updated, as that's not what the help in the profile section says, and starting to type the auto-completing options didn't really give reason to suspect that either.

Didn't mozilla also have a related speech to text software that got canned/moved to a different company? Or was that different?

DeepSpeech? https://github.com/mozilla/DeepSpeech

Mozilla didn't want to fund further development, most of the team ended up at Coqui.ai

Mozilla shut that project down same day (Apr 12, 2021) as: "Mozilla is partnering with NVIDIA, which is investing $1.5 million in Mozilla Common Voice,". Aka they got paid off by Nvidia to not compete.

DeepSpeech is not competition for NVIDIA, quite the contrary. More people using DeepSpeech means more GPUs sold.

Seems more likely that Mozilla would have shut down both projects, but NVIDIA funding saved the more important one.

Does speech to text require that much compute?

EDIT: Nvm, there seems to be a new project called Sayboard that does everything on your phone:

https://github.com/ElishaAz/Sayboard

(though switching from Swiftkey is a bit annoying)

Thanks, that's the one I was thinking about. I remembered it had an odd name.

This is an open dataset of voice samples to train models, so not really STT/TTS software.

I wish they'd concentrate on the browser.

voice integration in a browser for control and feedback would be great if you were blind

Better in the DE than an app, even the browser, unless its like ChromeOS and the browser is the DE.

And text-to-speech. Which is already a standard: https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...

The web is in a hilarious state where it's harder to style an option in a drop down than it is to generate speech from some text

Mycroft users really wished that Mozilla had kept up efforts in this direction, because otherwise the only option for reliable speech-to-text is uploading every command you give your agent to Google or Baidu. The browser is important, and I don’t support Mozilla’s vacuous projects for social-justice cred, but there are a handful of areas where we need some non-profit to provide a privacy-respecting solution.

That is indeed important, so I take it back (can't edit original post now).

Accessibility is an important part of the browser :)

What work do you want them to do that they're not already doing?

I’m sad that this is English only. I’ll love to contribute lots of voice for a Dutch TTS from an nonprofit org like Mozilla

They do collect other languages - there’s a setting for it in the annotation section, and the dataset downloads let you choose other languages.

e.g.: https://commonvoice.mozilla.org/nl/listen

Woops! Thanks :-)

Don’t feel bad - it’s not especially obvious. I only thought about it because I’m already familiar with the project.

Although English is the most-contributed language, one of the goals of Common Voice is to support languages that wouldn’t normally receive attention from commercial providers.

The most-contributed language is Catalan with 3678 hours recorded vs. 3395 hours in English https://commonvoice.mozilla.org/en/languages (The language list sorts your browser's UI languages ahead of all others, which is why English may appear on top for you.)

https://commonvoice.mozilla.org/en/about?tab=how-add-languag...

Voice datasets also underrepresent: non-English speakers, people of colour, disabled people, women and LGBTQIA+ people.

How does being gay change your voice?

https://en.wikipedia.org/wiki/LGBT_linguistics#Accents_of_En...

I'm aware of the trope. I've yet to meet anyone that adheres to it, though. Always thought it was just one of those things that Hollywood overemphasizes to "other" gay people.

Always thought it was just one of those things that Hollywood overemphasizes to "other" gay people.

They may over emphasize it. I don't know. But now you know they didn't invent it.

GP linked you to a summary of a scientif finding, not a trope. You can read more about the study here [0].

Regardless, I think the point of collecting these stats is to make sure voices of people that might normally be under-represented in a uniform sample of a relatively small size can be corrected for. These are common stats to collect in surveys for similar reasons.

[0] https://www.sciencedirect.com/science/article/abs/pii/S00954...

There was someone in my high school who had the stereotypical voice. One of his friends, who'd known him since they were little, mentioned that he had talked like that his whole life.

With recent events in AI and deepfake technology, I would need to see some assurances before I agreed to “donate my voice” to something like this. It seems like the project is for voice recognition, not generation, but it’s not immediately clear.

I don't know if assurances is the right term, but everything around machine learning and generation seems to be quite liberal with respecting people's property, so indeed something called "donate your voice" made me pause.

Mozilla is probably the right organization for that. Their main product however is dwindling, and I'm not sure what will happen to their data if they ceased to exist. There is a tendency for dying organizations to be pulled apart for scraps, and this would definitely become an IP of interest for a lot of companies with much lesser noble causes

The recordings are available for download, so if a company wants to use them for less noble causes, they can already do that.

What assurances would you like to see?

How many people here have a different "reading voice" vs their normal conversational voice? Can conversational models be trained even if much of the training data sounds "scripted"?

I remember when they (Mozilla’s CV team) solicited feedback before they got started, I brought up that issue and proposed a different approach to gathering conversational speech data, but it wasn’t picked up. The belief that it’s better to have more but crappy data rather than less data matched to what you actually want to solve is quite pervasive.

Crowdsourced datasets like this and the ones produced by the OpenAssistant project could easily become the ONLY way to build foundational models if the courts decide that what OpenAI and co are doing is not Fair-Use. I don't think I would call this scenario unlikely, either.

Amazing.

One of my hopes with OpenAI were that they were going to be truly open.

Open datasets, open code, open models, open evaluation.

But it is now a Microsoft puppet running on corporate profit goals.

This and HuggingFace are great to see. I hope HuggingFace isn’t acquired by Microsoft like GitHub did.

Why then is the text2speech in reader mode (which other than that is excellent) on a Linux Firefox so extremely bad? Much worse than Steven Hawkins text2speech.

While this dataset is orders of magnitude smaller than what recent speech models like Whisper and Seamless got trained on, and while it is meant for supervised as opposed to self-supervised learning where data is more abundant, it can still be useful for finetuning an existing model for improving its score on a specific language.

But, why?

Related. Others?

Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours of Speech - https://news.ycombinator.com/item?id=28073016 - Aug 2021 (170 comments)

Firefox Voice - https://news.ycombinator.com/item?id=24096082 - Aug 2020 (154 comments)

Firefox Voice: Browse the web with your voice - https://news.ycombinator.com/item?id=23902560 - July 2020 (2 comments)

Mozilla Common Voice Dataset: More data, more languages - https://news.ycombinator.com/item?id=23695377 - June 2020 (41 comments)

The Common Voice Project by Mozilla reached its first goal: 1k hours in englisch - https://news.ycombinator.com/item?id=23051756 - May 2020 (1 comment)

Common Voice: A Massively-Multilingual Speech Corpus - https://news.ycombinator.com/item?id=21887693 - Dec 2019 (9 comments)

Common Voice – Mozilla's initiative to help teach machines how real people speak - https://news.ycombinator.com/item?id=21268579 - Oct 2019 (49 comments)

Mozilla releases the largest to-date public domain transcribed voice dataset - https://news.ycombinator.com/item?id=19270646 - Feb 2019 (61 comments)

Mozilla Overhauls Speech-To-Text Contribution Interface - https://news.ycombinator.com/item?id=17436958 - July 2018 (42 comments)

Initial Release of Mozilla’s Open Source Speech Recognition Model and Voice Data - https://news.ycombinator.com/item?id=15808124 - Nov 2017 (88 comments)

Project Common Voice - https://news.ycombinator.com/item?id=14794654 - July 2017 (57 comments)

Mozilla: Project Common Voice - https://news.ycombinator.com/item?id=14786881 - July 2017 (1 comment)

I'd like to give a shout-out to Common Voice Android: https://github.com/Sav22999/common-voice-android

It's a handy app for those interested in contributing to the project. You can record voices for the languages you speak and validate other user contributions. I used to be a frequent contributor about two years ago, and this app had a much more user-friendly design compared to the official website version.

Additionally, check out the official Common Voice Matrix channel: https://chat.mozilla.org/#/room/#common-voice:mozilla.org