I wonder: did they address common ML ethics questions? Specifically: Are the ML algorithms better/worse on male than on female speech? How about different languages or dialects? Are they specifically tuned for speech at all, or do they also work well for music or birdsong?
That said, the examples are impressive and I can't wait for this level of understandability to become standard in my calls.
Why is the ethics question important? It is a new feature for an audio codec, not a new material to teach in your kids curriculum.
This is a great question! Here's a related failure case that I think illustrates the issue.
In my country, public restroom facilities replaced all the buttons and levers on faucets, towel dispensers, etc. with sensors that detect your hand under the faucet. Black people tell me they aren't able to easily use these restrooms. I was surprised when I heard this, but if you google this, it's apparently a thing.
Why does this happen? After all, the companies that made these products aren't obviously biased against black people (outwardly, anyway). So this sort of mistake must be easy to fall into, even for smart teams in good companies.
The answer ultimately boils down to ignorance. When we make hand detector sensors for faucets, we typically calibrate them with white people in mind. Of course different skin tones have different albedo and different reflectance properties, so sensors are less likely to fire. Some black folks have a workaround where they hold a (white) napkin in their hand to get the faucet to work.
How do we prevent this particular case from happening in the products we build? One approach is to ensure that the development teams for skin sensors have a wide variety of skin types. If the product development team had a black guy for example, he could say "hey, this doesn't work with my skin, we need to tune the threshold." Another approach is to ensure that different skin types are reflected in the data used to fit the skin statistical models we use. Today's push for "ethics in ML" is borne out of this second path as a direct desire to avoid these sorts of problems.
I like this handwashing example because it's immediately apparent to everyone. You don't have to "prioritize DEI programs" to understand the importance of making sure your skin detector works for all skin types. But, teams that already prioritize accessibility, user diversity, etc. are less likely to fall into these traps when conducting their ordinary business.
For this audio codec, I could imagine that voices outside the "standard English dialect" (e.g. thick accents, different voices) might take more bytes to encode the same signal. That would raise bandwidth requirements, worsen latency, and increase data costs for these users. If the codec is designed for a standard American audience, that's less of an issue, but codecs work best when they fit reasonably well for all kinds of human physiology.
What if it is a pareto improvement: better improvement for some dialects but no worse than the earlier version for anyone. Should it be shelved or tuned down so all improvement for each dialect see gains by an exactly equal percentage?
Here's a question that should have the same/similar answer: Increasingly some part of the job interviews is being handled over the internet. All other things being equal, people are likely to have a more positive response to candidates with more pleasant voice. So if new ML-enhanced codecs become more common, we may find that some group X has a just slightly worse quality score than others. Over enough samples that would translate to lower interview success rate for them.
Do you think we should keep using that codec, because overall we get a better sound quality across all groups? Do you feel the same as a member of group X?
I don't think it's a given that we shouldn't keep using that codec. For example, maybe the improvement is due to an open source hacker working in their spare time to make the world a better place. Do we tell them their contribution isn't welcome until it meets the community's benchmark for equity?
Your same argument can also be used to degrade the performance for all other groups, so that group X isn't unfairly disadvantaged. Or, it can even be used to argue that the performance for other groups should be degraded to be even worse than group X, to compensate for other factors that disadvantage group X.
This is argumentum ad absurdum, but it goes to show that the issue isn't as black and white as you seem to think it is.
A person creating a codec doesn't choose if it's globally adopted. System implementors (like for example Slack) do. You're don't have to tell the open source dev anything. You don't owe them to include their implementation.
And if their contribution was to the final system, sure, it's the owner's choice what the threshold for acceptable contribution is. In the same way they can set any other benchmark.
The context here was Pareto improvement. You're bringing a different situation.
The grandparent provided an argument why we might not want to use an algorithm, even if it provided a Pareto improvement.
I suggested that the same argument could be used to say that we should actively degrade performance of the algorithm, in the name of equity. This is absurd, and illustrates that the GP argument is maybe not as strong as it appears.
One thing the small mom-and-pop hacker types can do is disclose where bias can enter the system or evaluate it on standard benchmarks so folks can get an idea where it works and where it fails. That was the intent behind the top-level comment asking about bias, I think.
If improving the codec is a matter of training on dataset A vs dataset B, that’s an easier change.
I would be very surprised if there is no improvement if the codec is biased towards particular dialects or other distinctive subsets of the data. And we could certainly be fine with some kinds of bias. Speech codecs are intended to transmit human speech after all. Not that of dogs, bats, or hypothetical extraterrestrials. On the other hand, a wider dataset might reduce overfitting and force the model to learn better.
If the codec has the intention of working best for human voice in general, then it is simply not possible to define sensible subsets of the user base to optimize for. Curating an appropriate training set has therefore technical impact on the performance of the codec. Realistically, I admit that the percentages of speech samples of languages in such a dataset would be according to the relative amount of speakers. This is of course a very fuzzy number with many sources of systematic error (like what counts as one language, do non-native speakers count, which level of proficiency is considered relevant, etc.), and ultimately English is a bit more important since it is de-facto the international lingua franca of this era.
In short, a good training set is important unless one opines that certain subsets of humanity will never ever use the codec, which is equivalent to being blind to the reality that more and more parts of the world are getting access to the internet.
I get your point, but in the example used - and I can think of couple others that start with "X replaced all controls with touch/voice/ML" - the much bigger ethical question is why did they do it in the first place. The new solution may or may not be biased differently than the old one, but it's usually inferior to the previous ones or simpler alternatives.
Imagine you release a codec which optimizes for cis white male voice, every other kind of voice has perceptibly lower fidelity (at low bitrates). That would not go well...
Yeah, imagine a low bitrate situation where only English speaking men can still communicate. That would create quite a power imbalance.
Meanwhile G.711 makes all dudes sound like disgruntled middle aged union workers
No offense/taken, but Codec2 seem to be affected a bit for this problem.
I get your point, but the questioner wasn't being rude or angry, only curious. I think it's a valid question, too. While it isn't as important to be neutral in this instance as, say, a crime prediction model or a hiring model, it should be boilerplate to consider ML inputs for identity neutrality.
This is actually a very technical question since it means the audio codec might simply not work that well in practice as it could and should.
Because this gets deployed in real world, affecting real people. Ethics don't exist only in kids curriculum.
This is an important question. However, I'd like to point out that similar biases can easily exist for non-ML, hand-tuned algorithms. Even in the latter case test sets and often even "training" and "validation" sets are used for finding good parameters. Any of these can be a source of bias, as can the ears of evaluators making these decisions.
It's true that bias questions often come up in ML context because fundamentally these algorithms do not work without data, but _all_ algorithms are designed by people, and _many_ can involve data in setting their parameters. Both of which can be sources of bias. ML is more known for it, I believe, because the _inductive_ biases are less than in traditional algorithms, and therefore are more keen to adopt biases present in the dataset.
Usually regular algorithms aren't generating data that pretends to be raw data. That's the significant difference here.
Can you precisely define what you mean by "generating" and "pretends", in such a way that this neural network does both these things, but a conventional modern audio codec doesn't?
"Pretends" is a problematic choice of words, because it anthropomorphizes the algorithm. It would be more accurate and less misleading to replace "pretends to be" with "approximates". But then it wouldn't serve your goal of (seeming to) establish a categorical difference between this approach and "regular algorithms", because that's what a regular algorithm does too.
I apologize, because the above might sound rude. It's not intended to be.
I was avoiding the word "approximate", because that implies a connection to the original raw data.
A generative model guesses what data should be filled in, based on what is present in its own model. This process is totally ignorant of the original (missing) data.
To contrast, a lossy codec works directly with the original data. It chooses what to throw out based on what the algorithm itself can best reproduce during playback. This is why you should never transcode from one lossy codec to another: the holes will no longer line up with the algorithm's hole-filling expectations.
Not really. Any lossy codec is generating data that pretends to be close to the raw data.
Yes, but the holes were intentionally constructed such that the end result is predictable.
There is a difference between pretending to be the original raw data, and pretending to be whatever data will most likely fit.
As a notable example, the MP3 format was hand-tuned to vocals based on "Tom's Diner" (i.e. a female voice). It has been accused of being biased towards female vocals as a result.
As a person from a different language/accent who has to deal with this on a regular basis - having assistants like Siri not understand what I want to say, even though native speakers don't have such problem... Or before an advent of UTF - websites and apps ignoring special characters usable in my language.
I wouldn't consider this a matter of ethics, and more of a technology limitations or ignorance.
Quoting from our paper, training was done using "205 hours of 16-kHz speech from a combination of TTS datasets including more than 900 speakers in 34 languages and dialects". Mostly tested with English, but part of the idea of releasing early (none of that is standardized) is for people to try it out and report any issues.
There's about equal male and female speakers, though codecs always have slight perceptual quality biases (in either direction) that depend on the pitch. Oh, and everything here is speech only.