The Seamless Communication models

I look forward to the day where I'm wearing my headphones in a foreign land and hearing all of the discussions in my own language.

The "universal translator" which was part of Star Trek and a lot of other Sci-Fi I was exposed to as a kid was something I was really fascinated with. My Dad worked as a simultaneous French->English translator and sadly spent long hours away from home and, as a kid, I started trying to build a translator so that it could do his work and he could be home more.

Translation is important work and one that could help a lot of people. It's my hope that we get to the point where these models work entirely on locally carried resources.

how am i supposed to talk shit with my friends about other people in public then

I'm curious to know how well these models can pick up slang. Maybe if you talk shit in as thick a slang as you can it won't be able to give a good enough translation.

With my bi/trilingual friends who speak the same languages, we intermix them to make our point more clear. Don’t think models will be good enough for mixes for a few more years, so we’re safe!

Can you show us an example of such a sentence?

Hm, think of things like “On va bruncher” (we’re going to brunch). The word “brunch” doesn’t exist in french, but we add suffixes to fit into the sentence. Very common in Montreal. My french isn’t very good to do that on the fly, but my francophone friends do that all the time.

In my other languages that I am actually fluent in, it’s kinda the same — you use specific suffixes to soften or embolden your point and so on. Maybe add “exclamation making sounds in specific language” too. Eventually your nouns and verbs end up in different languages, with different suffixes where it “makes sense”, yet the person whom you’re talking to will “get it”.

Would be curious to try the new Seamless model on such speeches.

I would thing this model would fail with a heavy quebecois lingo, as opposed to standard French.

This is extremely common for every new technology: “upload,” “download,” “stream,” “google,” “FaceTime,” most code patterns, all the new ML apps, “venmo” or whatever the name of the app you use for payment, etc. all of those are taken as is, slapped a verb termination and it’s good enough. That’s true in German, Danish, Dutch, French, Italian, and Spanish.

The only thing that doesn’t work is if you talk to people too young to remember Skype. Then you feel old.

Reinventing polari is certainly one way to make yourself less understood...

Cockney English and French Verlan comes to mind.

I don’t know for cockney but verlan is very alive.

I'd love to see a map of how it matches up to regional English/British accents and their slang.

learn Klingon?

Klingon is definitely going to be in the top 50 languages covered…

Speak in metaphor and/or code.

I’ve been in mixed language communities in which I wasn’t sure who spoke what, and I have found this to be quite effective when done right.

Good time to reference st:ng “darmok” episode and quotes like “darmok and jalad at tanagra”.

Cincinnati when the Turkeys fell.

get better at double speak https://en.wikipedia.org/wiki/Doublespeak

If I am not wrong, Google Pixel buds offer live translate feature.

Not in the voice of the original speaker.

now if I could just get the pixel buds tech to remove the voice of the original speaker and translate some youtube videos from thick accent english into no accent am-english.

Obligatory, not directed at you in particular since I'm sure you mean no offense, but just voicing a pet peeve:

I grew up bilingual outside the US, and speak English with a hybrid British/Indian/Middle Eastern accent (with some of my personal quirks, and mixing increasing amounts of various American accents over time). I can understand English in nearly any accent (Singaporean, Chinese, Vietnamese, Indian, Nigerian, eastern European) as long as the words involved are globally used and the grammar is passably queen's. Especially after hearing it for about an hour. And people who natively speak English with these various accents usually can understand my English better than they can an average American accent. Yet in this country, my accent is belittled, despite being perfectly understood and more versatile. Even by others who don't speak with the American accent!

This is the problem of the "default accent" anywhere being referred to as "no accent", and therefore anything deviating is considered "having an accent". This makes "accent" a negative trait, scaling from 0-bad to heavy-bad. But if the vernacular were such that we said "American accent" instead of "no accent", then noone's accent is bad, just not used to.

Most of my non-American peers who were raised on English have a better command of the language than my American ones, yet they are mocked for their accents as if they don't know the language, when in reality it's the Americans lack of familiarity with the language (as its used globally) preventing them from comprehending the language.

So yes, put in more work, the world is shrinking and English is the global language (for better or worse). What you're saying is spoken from a position of privilege because the culture allows you to mock others' accents and imply your version of it is the correct one that everyone else should put in work to provide you with, rather than the other way around.

Every time you hear English with an accent other than British, American or Australian, remember that it usually means the speaker knows at least one entire other language as well, probably one that you would sound like an idiot if you tried to speak it. Don't be rude or dismissive of their command of English.

In fact, you were so close — you called it a "no accent am-english", when you could have just called it what it is — "an american accent".

I appreciate your sharing, and stating that you assume I meant no offense, and that your thoughts are not directed at me specifically.

I could of been more specific, but my request for the tech to vary, I think would lead to specific options for different people.

And actually to be even more.. not sure the word.. I want 'the Chicago accent' I think it's called, or midwest / no accent. Personally as much as I enjoy some entertainment from Jersy / NY accents, I would not volunteer to watch tutorials on tech taught by the Sopranos cast - as funny as that might be (and I get if you are from the NE, you may be learning just fine being taught with such a language style).

As annoying some of the Cali style of language is, I can understand the words and meanings without squinting my ears and spending double the brain cycles trying to understand the words, while then interpreting the meaning, and then trying to put together concepts for understanding new ways of coding or using tech.

I've run into folks in Louisana that I could not understand at all and had to ask for an interpreter at a gas station. From Florida to Chicago to Seattle down to Miss and Ala - I can hear what people are saying and learn without spending lots of extra energy trying to understand.

With that being said, I understand there are parts around Miami where accents may be thicker (or not) - and with some folks even if using the rights words and grammar, I may need to slow down the speech to actually learn if they were teaching a class.

The slow down and speed up options already exist with youtube.

"So yes, put in more work"

- I do try a bit. I don't mind accents with some folks and media.For example I can listen to and enjoy Shankar sharing via the 'hidden brain' series, partially because his accent is limited but also because the media requires less thought intensity.

I have tried many youtubes, and bought a few courses taught from folks in India and other places where I just could not muster the energy. I literally squint with my ears and feel like my head gets hot trying to decipher what is being said, translate into what is meant, and how it should create new patterns of understanding in my brain.

I can only do that for so long and I am done. Now I just skip any learning video that has non-am English speakers. When I consider courses to sign up for or buy, I have to research the authors / speakers and find video of them to hear the audio, because I just can't learn well that way.

"other than British," - True story, a few years ago I had to call an ISP in Britain(?) and the person I got to to file an issue with, I could not understand them. I had ask 'what did you just say' many times. I laughed at myself for even thinking of saying 'can you slow down and speak clearer English please' - I mean, crazy... I was paying by the minute for the long distance at the time and it ended up being a 25 minute call that could of been 10 if I had a magic translate without accent device.

"a position of privilege because the culture allows you to mock others' accents"

- This is truly not about mocking accents, this is truly about my lack of ability to learn well.

Yes, I would defintely sound like an idiot trying to speak another language. Like I said, I do not learn as well as some others.

Truly not my intent to be rude. I apologize if the shortness came off that way, I was trying to be brief in the hope that there's a chance that some tech like this exists and someone here could point me to it. Before I posted, I DDG'ed it and found a couple of things attempting to be in that space with a 'speak to sales' type of 'you'll never afford this' button for info.

I will never be dismissive of anyone's command of English, or other spoken language, or computer language or anything like that. There is no way for me to know someone else's situation and circumstances led them to their current command of whatever language. If someone is trying to learn more at any age; I applaud and encourage them - being rude or dismissive does not encourage more learning.

"no accent am-english", when you could have just called it what it is — "an american accent". - Well maybe, but actually I meant to be more specific, as mentioned a bit above - I mean '"no accent" American accent' - because there are plenty 'American accent' types that I would want removed by a magic earpiece to make it easier for me to understand and learn.

I appreciate the thoughtful reply. I don't think you're rude, and I get what you're saying as someone who thinks a lot about accents and languages. However, I still think you missed my point.

There is no "no accent". An accent is a baseline feature of intelligible human speech, like a voice, or a volume, or a language. You can't say stuff without those features. When you say "the Chicago accent", or the "Midwest accent", that's an accent! Not "no accent".

I understand it's common usage to refer to the default "radio accent" as "no accent", but in a country like America, all kinds of people with all kinds of accents speak English. Reinforcing an expectation that a certain (usu. majority-white-spoken) one is the "default" by referring to it as "no accent", implicitly suggests all others are erroneous affectations, even if I trust that is not your personal intent.

All that said, I think your idea for a translation device capable of revocalizing what is said with an unfamiliar accent into one you are used to is not a bad one, and likely easier than translating between languages while retaining expressiveness.

reinforcing an expectation that a certain (usu. majority-white-spoken)

Wow, you just keep digging in don’t you? When these Americans you deride say “no accent”, do you think they are referring to the “majority-white-spoken” Scottish accent?

No, of course not. Get that race baiting out of here.

https://www.bbc.com/culture/article/20180207-how-americans-p...

What accent? Whose accent? Brits are as diverse accent wise as Americans, London, cockney, New England, Southern...

A lot of Indians that I know have a very "proper" British accent, one that maybe a bit aristocratic, its quite an irony for a former colony. https://www.bbc.com/future/article/20220915-what-the-queens-...

The context matters, but so does history.

There is another way of looking at this, in the context of the parent post: we could suggest that any accent could be converted to “no accent” where American accents are converted to British, or where standard Japanese is converted to a Nagoya pronunciation. Whatever seems like your preference of “no accent”. With this interpretation of the parent post, it’s not specifically about any particular English accent. I’ve been told by others that I have an accent yet I think I don’t have one - and honestly, I think most people have either encountered this - having an accent when you think you don’t have one - or haven’t travelled enough! :)

And I mean, yes, there are people who know they don’t sound like whatever ideal accent they have in mind, and there are people who will make fun of accents - but, and I can’t stress this enough, depending on the context literally any accent can be made fun of, sadly. I’ve had people mock my “American” accent while travelling, for example. It sucks, but it’s not easy to single out any accent as “default” unless it’s literally enforced by a government and taught that way in schools. Last I checked, the US is not one of those countries and English is not as centrally controlled as e.g. French can be.

This would carry some weight if you didn’t take an opportunity to take a shit on Americans’ English in the middle.

In fact, you were so close — you called it a "no accent am-english", when you could have just called it what it is — "an american accent".

There are many american accents. Your suggestion makes the sentence much less clear.

And by specifying "american" they're already making it clear there is no such thing as a universal base accent for english.

This is a really interesting use case. I could definitely see this as a service for content providers to get more reach and I think you could justify a subscription price for the service based on this.

By keeping creating speaker specific tonal ranges and profiles you maintain the better cohesion on the final product.

It would be really cool as an assistance in practicing correct pronunciation and accent. Hearing your voice saying it right and then hearing how you actually said it the last time you tried might help you to get both into alignment.

I worked on building exactly this earlier this year. I was hanging out in Taiwan for a few months and thought, surely the Babel Fish should exist by now.

I did several experiments recording from all the microphones I could on my iPhone and AirPods while out in the wild. My conclusion: it's impossible right now for that hardware given the microphones we have and what they pick up.

So much of what's spoken is at a combination of (a) high distance (b) low volume (c) background obscuration. Something that was clear as day to my ears would barely register on the mics. While context is of course an issue, the raw audio didn't have enough to even translate.

The one caveat is that there might be low-level (i.e., Apple-only) access to headphone microphones that capture the environment to do noise cancellation. I'm not sure though---I couldn't find them on any API.

For cases where you do have clear audio, existing apps (e.g., Google Translate) are so close to achieving this, but don't let you specify audio outputs with enough fine grained control. By default, it will start screaming out of your phone what you were attempting to silently translate.

There's also some magic to the Universal Translator and Babel Fish: they perform zero-shot real time translation.

That is, they are able to translate (in all directions) novel languages that were not previously heard[0]. It is an open question, with likely a negative answers, that there is a universal grammar even among humans[1] (the definition itself is vague but even the most abstract version is suspect and highly likely to not be universal across species). I think no one will be surprised if it is always impossible to interpret an entire language based on only a few words (let alone do it in real time)

This isn't a knock down, because even a trained device is insanely useful, it's just a note about limitations and triage. This is awesome stuff and I can't wait for the day we have transnational headphones. It's an incredibly complex problem that I'm sure is not short of surprises.

[0] There are a few exceptions such as Star Trek TNG's episode Darmok, S5E2, where the Tamarians' language is unable to be translated due to its reliance on cultural references (the literal words are translated but the semantic meanings are not). It's a well known episode and if you hear anyone saying "Shaka, when the walls fell" (translates to "Failure") they are referencing this episode (often not using the language accurately but who cares (nerds. The answer is nerds)).

[1] https://en.wikipedia.org/wiki/Universal_grammar

Can’t speak for ST, but did they ever say the babel fish understood languages it never heard before? I thought the galaxy was just exceptionally well-cataloged, given the HHG itself, and humans were hardly unknown.

The babel fish translated via brainwave energy and a telepathic matrix:

> The Babel fish is small, yellow and leech-like, and probably the oddest thing in the Universe. It feeds on brainwave energy received not from its own carrier but from those around it. It absorbs all unconscious mental frequencies from this brainwave energy to nourish itself with. It then excretes into the mind of its carrier a telepathic matrix formed by combining the conscious thought frequencies with the nerve signals picked up from the speech centres of the brain which has supplied them. The practical upshot of all this is that if you stick a Babel fish in your ear you can instantly understand anything said to you in any form of language. The speech patterns you actually hear decode the brainwave matrix which has been fed into your mind by your Babel fish.

“Now it is such a bizarrely improbable coincidence that anything so mind-bogglingly useful could have evolved purely by chance that some thinkers have chosen to see it as a final and clinching proof of the nonexistence of God.

“The argument goes something like this: ‘I refuse to prove that I exist,’ says God, ‘for proof denies faith, and without faith I am nothing.’

“‘But,’ says Man, ‘the Babel fish is a dead giveaway, isn’t it? It could not have evolved by chance. It proves you exist, and so therefore, by your own arguments, you don’t. QED.’

“‘Oh dear,’ says God, ‘I hadn’t thought of that,’ and promptly vanishes in a puff of logic.

“‘Oh, that was easy,’ says Man, and for an encore goes on to prove that black is white and gets himself killed on the next zebra crossing.

“Most leading theologians claim that this argument is a load of dingo’s kidneys, but that didn’t stop Oolon Colluphid making a small fortune when he used it as the central theme of his best-selling book, Well That about Wraps It Up for God.

“Meanwhile, the poor Babel fish, by effectively removing all barriers to communication between different races and cultures, has caused more and bloodier wars than anything else in the history of creation.”

I couldn't help but hear this in my mind as it was read in the voice of the narrator from the old BBC "Hitchhiker's Guide" mini-series.

I think idea of Babel Fish might encroach on the computational complexity limit in some sense. Imagine a future "Theory of Everything" book written in alien language. The book has total of 1 million characters across its pages where each character is distinct. Now Babel Fish must be able to "translate" such a language to English given its oracle like powers? Can it do the job?

Well, then. Magic indeed!

Also a lot of spoken language involves context that AI is nowhere near understanding yet, let alone all the cultural baggage necessary to accurately translate/localize a lot of utterances.

"Can you stand up?" would be translated differently into Japanese depending on whether you're implying you need them to move their butt off your cell phone versus directly inquiring as to the function of their legs after a car accident. If you speak English and hear it as a background without the rest of the context being picked up, your brain instinctively knows it can interpret it either way, no problem.

But if you're Japanese and the AI picks a specific way to translate it, then you are completely unaware of the ambiguity because the AI resolved it with a 50% chance of being wrong.

"Can you stand up?" would be translated differently into Japanese depending on whether you're implying

nitpicky, but is it though? not really. and it's as much 'difference depending on what you're implying' as there would be in english comparing just saying 'can you stand up' or specifying 'from the seat/at all'.

Probably not the strongest example but there are definitely phrases that are specific in one language but ambiguous in another.

There are certainly nuances, even when 'understood'

Google: "A bit sticky, things are pretty sticky down there."

I'm on mobile so can't find the link but years ago there was a DARPA (iirc) program trying to solve this problem in the context of surveillance in a loud crowded room. Their conclusion was that there needed to be n+1 microphones in the room to be able to cleanly differentiate all of the noise, where n is the number of noise sources, which in their case was number of conversations going on in the room (assuming no other loud sources of noise like music).

I think it's totally doable but you'd need many more microphones in order to deal with real world noise. As MEMS microphone quality improves, this should eventually be possible with a combination of smartphone/headphone/some other device like something around your neck.

The problem is you need a full sentence, plus surrounding sentences to properly translate a lot of things (aka context matters).

So no matter what, conversations in your native speech would have to be delayed before translation.

So then we need something like neuralink to get the whole thought from one's brain first, then the sentences are processed properly for the context, then translated before the speech is delivered.

Most thoughts are in a language. There is no one underlying universal machine language for the brain.

Are most thoughts in language? This doesn’t reflect my experience. Language floats on top, but there is layers under there. You can also feel it when you end up thinking in another language. It does not go through the first one but is a thing of it’s own.

Pretty sure there is nothing universal there though as you say.

My understanding is that they trained a separate model to specifically estimate when they have enough context to begin translating, as a skilled translator would.

My mom used to do English/French translation. Her favorite example was the word "file". That word has multiple translation in French depending on the context, and that context may simply be implied by who is speaking. You may not be able to figure it based on the conversation alone.

Even the native original version needs the proper context. Sometimes you need the entire sentence to figure out what the sentence was really about.

I'm reminded of Mark Twain complaining about verbs arriving at the very end of sentencess in German (among a myriad of other complaints)

"The Awful German Language* -Mark Twain https://faculty.georgetown.edu/jod/texts/twain.german.html

Sometimes you even need a second sentence of even a few to understand what the first sentence was about.

I think I could adapt to that. But it would be an interesting experiment.

Another lesson we can learn from Sci-Fi is very often different species on a planet would have their tribal / local languages and dialects but all spoke a common tongue. I think this is the more humanizing approach, rather than delegate even more of our fleshly processing power to machines.

This seems to be what is happening in Europe (and perhaps more generally across the globe), with English being the common tongue.

Question is, what will happen to the tribal / local languages? Will they survive?

Historically, we've seen the larger languages build themselves up by intentionally stamping out the teaching / use of smaller local languages. France banned some regional languages from appearing on broadcast television for years, and etc.

This might be required to get full buy in for a unified language, which is a bit sad but makes some sense - if you ensure it's taking up more and more of media and culture more people know it from immersion, and other languages are reduced to being spoken at home / with friends and that's going to cut into how many people really are fluent in them.

It varies. A lot of local languages have gone extinct already. There's linguists hard at work to try and document / record dying languages, but it won't be the same as living the language from childhood.

then of course, there's always Darmok and Jalad at Tanagra

I’m wearing the Rayban Meta right now and they are already mind blowing, I can already talk to that Meta AI assistant seamlessly. I bet one of the future iteration will have exactly this.

Curious, what do you ask it besides take a picture / video or what's the weather?

I have a pair and have only asked it that so far...

Whenever I have a question and used to pull up bard/chatGPT, and if I’m wearing my glasses.

Kind of like having an expert next to you all the time.

I look forward to the day when that problem is solved by a company that doesn’t mine my data to sell ads.

Babel Fish

Can't wait for someone to roll a language tutor out with this tech.

Everyone gets a personal tutor for hours a day.

I would absolutely love a VR game where I just need to work in China or Mexico all day and pick up the language that way.

Isn’t having the AI do it for you better than having the AI teach humans to do it?

Sure, if you're not into personal growth. Not everyone wants to become the useless bit of lard sitting in a chair while a computer does everything for them. Yet. Some of us still like to do the actual things, but just need some assistance along the way. We still have a bit of time before we're all the humanoids from Wall-E

Yeah thats why I mill my own grain and am getting into textiles.

I love when people use this pathetic extreme examples, when they don't have any meaningul arguments.

That isn't an extreme example at all, people used to mill grain and make clothing by hand, now we don't. We somehow are not sitting around getting fat even though technology takes care of those tasks.

The parents suggestion is that if we don't have to learn languages that will lead to us all laying down drinking big gulps while robot slaves take care of us. Their take is the extreme example. People have literally made this same suggestion about every technological advance and it never comes true.

We still have a bit of time before we're all the humanoids from Wall-E

Obligatory reminder that the movie itself explains that people are what they are not because of their lifestyle, but because of the time spent in low-gravity environment.

not sure that really matters to the point

It depends on what your goal is; for some tasks it's possible that getting the AI to do it is best, but, e.g. the existence of auto-pilot doesn't mean that hobbyist pilots wouldn't benefit from/enjoy exercising the same skills manually.

Maybe prior to fluency, for something like an odd business or tourist trip.

But there's a point in language learning where you can come to express yourself directly in a new language without intermediary "thinking" in your first tongue. The communicative and expressive potential of that mode is much higher than trying to squeeze one's intent through any kind of translation, machine or internal.

Plus, you know, it's fun.

Even a perfect human translator following you around wouldn't be anywhere near as good as knowing the language yourself.

Not necessarily. It depends on the use case. For taking a vacation, having an AI that can instantly translate to your native language would be amazing. That’d solve a lot of real world problems, no doubt.

However, translation has a great deal of subjectivity embedded in it, particularly when there aren’t 1:1 translations. Case-in-point: there are many English translations of the Christian bible, all similar enough, but there are enormous variations in some cases. And there are at least as many branches of Christianity as there are English translations of the Bible. Some of them strictly recommend the same translation, and they still disagree on the meaning of various passages.

Besides the problems inherent to translation, learning another language gives you another paradigm of thinking. The words we use, the way we construct sentences, etc., all impact our view of the world. Here’s a paper that discusses the impact of the over-reliance on English in cognitive sciences, and how this has downstream effects: https://www.sciencedirect.com/science/article/pii/S136466132...

Learning languages as an adult also has protective benefits. It reduces the probability of Alzheimer’s (maybe dementia, overall?).

In the way that watching porn is better than having sex.

This is what I'd like to build (the tutor part at least, not the VR game part yet). I'm planning to extend my current English only rough prototype[1] to support Mandarin. (I happen to be learning Mandarin myself at the moment, and there are a bunch of open source bilingual Mandarin LLMs and speech synthesizers from China to choose from.)

I think a lot of people are working on similar things right now. I know of one called http://yourteacher.ai

[1] https://apps.microsoft.com/detail/9NC624PBFGB7

Is there a high quality speech synthesizer (ideally local) for Mandarin you have found? There are some subtleties with tone sandhi rules and how they interact with prosody that I feel are lacking with current TTS voices I’ve tried.

I love the idea of LLMs being super-efficient language tutors. And you have a good point; coming soon: "We've been getting a lot of these tourists here lately, they're eerily fluent, but all seem to have the same minor speech impediment" (read: messed-up weights in a commonly used speech model).

all seem to have the same minor speech impediment

Ah, that is called an accent.

Kind of, Accents are typically derived from the intersection of natural languages, specifically which ones you learned the phonetics of first. (With the exception of the Mid-Atlantic accent...)

This would be something quite novel as the speech irregularities would not have their origin in people

I don't know what you would call it but it needs at least some adjective before accent to differentiate it IMO

I've been using ChatGPT 4 to translate and explain various texts in Mandarin and it's been very on point (checking with native speakers from time to time, or internet searches). As expected, it has trouble with slang and cross-language loanwords from time to time. However for languages with much lower information online, it hallucinates like crazy.

coming soon: "We've been getting a lot of these tourists here lately, they're eerily fluent, but all seem to have the same minor speech impediment"

Haha, if that were to pass, that would still be a far better outcome than our current situation of completely blind machine translation (this is especially for various Asian languages that are very sensitive to phrasing) and mispronunciation by non-native speakers.

The first one I plan to try is https://github.com/netease-youdao/EmotiVoice

I don't have the expertise to judge the quality of Mandarin pronunciation myself, being a beginner. But it sounds OK in English and it's made by native Mandarin speakers in China so I expect that it sounds better in Mandarin than English.

Sounds pretty good, although still lacking in natural-sounding tone sandhi (e.g. try 一下, it should be yi2xia4 instead of yi1xia4).

Do you have a favorite Chinese learning app ?

the azure neural tts voices in chinese are the best i’ve heard, specifically the “xiaochen” voice. i use it in anki daily to generate sentences for my mandarin decks with an api key/plugin. it’s not something you run locally of course, but they have a decent enough free tier.

i’m hoping a voice as realistic as this becomes a local app soon, but i’ve not found anything that’s nearly as natural sounding yet. (also, honorable mention to chatgpt’s “sky.” she pronounces mandarin with a funnily american accent, but it sounds natural and not as robotic as the open-source alternatives i’ve tried)

There’s already a few of them. Checkout https://hallo.ai

I wouldn't feel good about anything that's not focused on a single language.

You end up with the Duolingo problem where you know to say the names of 20 different fruits but not how to introduce yourself.

Never seen that in Duolingo. It starts with the basics and phrases, not random useless vocabulary.

I was going to Italy and started using Duolingo to try and help. I learned such useful phrases as "the children have bread".

(Duolingo problem(, AIUI): Duolingo is designed around such premise that, by exposing your subconsciousness to such small set of words and phrases in target languages, your brain should be able to trivially construct output shims from Universal Grammar, which must exist, to desired languages; but that doesn't work in practice and you end up with small set of words and phrases your subconsciousness had recorded)

the Duolingo's problem it is not because they have a bunch of languages, it is because achieving fluency in a target language it is about been able to produce/generate phrases, and they just move you to consume and sort words and phrases. in the case of any AI Language tutor, the student must produce phrases in order to practice, and that makes them advance in the path to achieving fluency

Duo has a different problem for me. The lack of focus means some languages don't get features. Chinese still doesn't have Stories (there's an unofficial version of it, but we've been waiting years).

You end up with the Duolingo problem where you know to say the names of 20 different fruits but not how to introduce yourself.

Not sure if this is a duolingo problem. There are of modules in duolingo specifically for saying your name. I think its the travel module.

To me the key functionally for any language learning app is giving you feedback on your pronunciation and general understanding. I’ve been using Duolingo to learn Mandarin and when I try to speak to anyone it’s difficult for them to understand me, because my pronunciation is all wrong. The app is just feeding info to me one way, and I can try my best to recreate what I’m hearing, but there’s no way to know if I’m messing it up. They do have a speaking feature but it doesn’t work very well, certainly not to the same level as speaking with a real person who is fluent in the language and having them correct you.

As a quick solution, you should try recording yourself speaking and then listen to it to check your pronunciation against some reference. So for example, find a YouTube video in the language you're learning that also has good subtitles (use https://filmot.com/ ) and listen to how they say the phrase and then record yourself saying the same phrase and play it back and compare.

It's the same struggle language learners have faced for a long time regardless of app or not. I did careful studies of the French grammar, read French books, listened to audio tapes in French, and I still got prononciation wrong often.

I resisted using Duolingo because I knew their speaking feature sucks. But the only reason I need an app rather than books or audio tapes is that I need something to correct my prononciation.

I practiced for a long time using the below pronunciation trainer and I get a ton of compliments from native speakers on how accurate my pronunciation is.

https://fluent-forever.com/product/fluent-forever-pronunciat...

There are other language learning apps, such as Busuu, which make you record and peer-review other people's pronunciations.

I think it would be so ironic if advanced AI ended up simply teaching us new languages quickly instead of translating for us.

Might be able to generate a better language than what we have.

Good point. Maybe they invent a better language and easily teach it to everyone.

Finally Esperanto has a use case!

seen a lot of these, but none for Indian languages. Would love to try an Indian language one!

Are Indian languages hard for English speakers?

I'm learning Hindi and there are somethings that are easy (phonetic alphabet, nothing like 7 different sounds for 'ough') but the sentence structure is very different and can be hard to get right. Pronunciation isn't too bad for the most part but there a few tricky things, for example four different 't' sounds and four different 'd' sounds. The hardest part is that there really aren't that many resources. Even though Hindi is the third most spoken language in the world, you will find far more resources for many of the less spoken European languages.

I would love a game that helped you learn a language (not necessarily VR though as I don't have that equipment). The game drops you into a world (a country of the language the game is meant to teach you) where no one speaks your language and you have to figure out what people are saying in order to fulfill quests. You get some hints, like maybe you have a simple translation guide in your inventory or sometimes you meet people who can speak a few words of your language. That would motivate me to learn faster than self-taught tutorials.

I'd love to learn French and the game would take place in locations all around modern France.

It would have to a good story. Maybe something in the style of Professor Layton series could be interesting, or something more open world.

If Professor Layton itself has a French translation then you're more than half of the way there! Existing games are already quite good for language learning. But indeed they're missing the "realistic" element that you're after.

But will people use them?

Started a project to do this a while back. It's pretty fleshed out:

https://www.parcero.ai/

I could integrate this instead of Polly pretty easily.

I built one for people in Latam to practice languages in a conversational way through a WhatsApp chat https://wa.me/+5491162951713?text=hola%20Speakeasy

Absolutely, what I've noticed is that the current apps are great for beginners but after a certain point the only way to improve your ability to speak a new language is to well... speak it. I built Proseable to help people move beyond the generic how to order a coffee or ask to go to the bathroom, and have more meaningful conversations in the real world. Check it out!

https://www.proseable.com/

I built just this a month ago with the Azure AI speech API, which is already pretty good at multilingual speech.

https://github.com/adrianmfi/gpt-tutor

I look forward to testing if switching to Seamless can improve it further, Seamless supporting nearly 100 languages is a nice improvement.

and the language tutor company could have you pilot around a menial labor droid while you are learning...

For Language Acquisition, Input Is All You Need. (Mostly)

What would be really cool is something that can autodub videos or audio into your target language. The hardest problem learning languages that aren't English is often finding content to consume in them.

Disclaimer : I am Krashenist so this take is biased

game

Yes! Better yet, you're a spy, or a hostage negotiator, or the leader of any kind of enterprise (army, business, aid organization) ...

Programming games like that will resemble directing improv theater. You can't program every response; you'll have to instead fit each character with beliefs and motivations.

I can hardly wait.

Seamless Streaming looks really promising! We just had a new employee start a few months back with profound hearing loss and our company had no idea what to do with him from an accessibility standpoint. They threw out solutions like Dragon, not realizing those solutions are not real-time.

He ended up rolling his own solution by standing up Whisper in one of our clusters and writing a basic front end and API to take his laptop’s mic input and chunk it every few seconds to send to the model and get back text in pseudo-realtime. We got him a pretty beefy Alienware so he wouldn’t be tied to the cluster GPUs. I can’t wait to see what he does with these new models!

Just wanted to say you’re a great employer to be so incredibly accommodating to the point you get them an Alienware and let them roll an accessibility solution

We need more support for employees like this!

Second this!

Also, what about Apple’s latest M3 series chips? Are this in the same realm as Alienware in terms of AI compute?

I think generally the consensus of Apple Silicon is that they're great _for a laptop_, but still aren't going to beat a dedicated graphics card + high-end CPU like i9/Ryzen 9. Biggest thing going for apple is the performance/watt though which is critical for a laptop.

I think this is missing the main reason to use Apple Silicon, which is that your dedicated graphics card probably has 24GB or less of RAM, whereas e.g. an M2 Ultra Mac Studio can have 192GB of RAM with a far superior memory bandwidth to anything on x86. This is important because even a "small" LLM like Llama2 13B would require quantization to fit in the 24GB RAM that the dedicated graphics card will give you, whereas the Mac could run Llama2 70B without quantization (at FP16).

Whisper doesn't need that much RAM though.

They definitely are in terms of energy efficiency

They're better than most consumers x86 CPUs but worse than using a GPU. Where they shine is when the ML model can't fit the GPU's VRAM since you have better options for ram size with macs.

Just wanted to say you’re a great employer to be so incredibly accommodating to the point you get them an Alienware

So gracious, to give a software developer some hardware to run the software they need to work, that costs a whopping nothing more than what other people in the industry get on the average.

and let them roll an accessibility solution

"You're such a good employer! You let your employee build their own accessibility ramp to the back entrance in their own time, and even got them a mortar spatula to do so!" We need more support for employees like this!

We need more support for employees like this!

And less support for employers like this.

Not sure why you're being downvoted. Literally the equivalent of building your own ramp.

I didn't downvote, but I considered doing so because nowhere that I saw in GP does it say in his own time, and that's a critical piece of the equation. Hallucinating that datum means they got the argument wrong, and worse they were harshly critical of the company based on that wrongly assumed information.

It reminds me of the Homer Simpson quote, "I don’t mind being called a liar when I’m lying, or about to lie, or just finished lying, but NOT WHEN I’M TELLING THE TRUTH!" I would be equally critical if it was warranted, but when it isn't it's deeply unfair to the accused.

If the person wanted to build their own ramp, and the employer let them do it on the clock, that's a completely different scenario than the employee having to come in during their off-hours to build the ramp just so they can go to work.

Yeah, it wasn’t on his own time. He had a full budget and this was right in line with stuff he had already done research in anyway, so he just went for it.

Awesome! I love hearing about places making the effort to be inclusive.

As someone who’s profoundly deaf myself, another less technical approach is to install Rogue Amoeba’s Loopback, and use it to pipe audio from a given app into a tool like Google Meet or Otter.ai using the Loopback device as the audio source. This effectively provides real time captions for anything running on your existing machine.

Awesome! I love hearing about places making the effort to be inclusive.

The extent of the effort being getting their employee a slightly-more-expensive-than-average tool that would enable them to do their job better regardless of the disability?

Such inclusive, much pat-yourself-on-the-back, wow.

"We gave our woodworking shop employee a quality saw so that they'd make their own accessibility ramps!"

I have literally been told in job interviews that the company would not be “allowed” to hire me because I’m hearing impaired, so yes, making an effort to support an employee’s disability and their needs is worth recognizing.

What would you have them do instead?

So what? Okay, in the case of a ramp, if you need one you probably are going to have difficulty building one. So pay employee Sally to build it instead, absolutely.

But hearing loss does not impair standing up servers and software. They can pay the employee who probably is the expert at this, the guy with the hearing loss, or go task Emil to go do it to ... avoid 'appearances'?

We definitely explored using these tools, but we’re constrained by government sponsor rules regarding data protection in our day to day work. We can use ZoomGov captions, but most of the other tools weren’t approved. It looks like Windows 11 has a real time solution of some kind, but we’re still stuck on 10.

I have profound hearing loss and rely on the Windows 11 captions. They are absolutely best in class and head/shoulders above any other automatic captioning I've used. My coworkers have a variety of accents (Hispanic, Eastern European, South African) and it does a great job with all of them.

Additionally, it supports multiple languages (only one at a time sadly), so I also use it for Japanese captions and it's equally great there.

Clever use of Google Meet as a tool! Also, Google Pixel phones now provide realtime captions to any speech playing on the phone (Accessibility > Live Caption). You can also choose a "preferred language" and the captions will be automatically translated to that language from other languages.

Google Chrome [1] also has captioning built-in [2], so this could also work from a plain page that hooks into the loopback device. Pretty sure it's using the same TTS backend that Google Meet uses.

The nice thing about Chrome feature is you can move the caption box around and keep it in the foreground while doing other things, although styling options seem limited (the text might be a little small for some).

[1] on desktop, not sure about mobile

[2] via chrome://settings/accessibility -> Live Caption

Whisper is pretty good for speech to text, and can be done with in a resource constrained environment. I tried a demo running in a browser using WASM on my phone and even the tiny model is not bad.

That's very nice of you

He ended up rolling his own solution

That's very nice of you

...doesn't compute.

What exactly was nice here?

We got him a pretty beefy Alienware so he wouldn’t be tied to the cluster GPUs.

Probably this.

Problem with whisper is its not really optimized for command recognition versus general dictation.

- Whisper processes 30 second audio chunks. So if you process 5 seconds of audio you have to pad it out with 25 seconds of silence. Hence a loss of efficiency with wasted CPU / GPU cycles on 25 seconds per chunk in the case above.

- Whisper most likely can't handle hundreds of commands much less than a thousand performantly.

- Whisper doesn't handle short commands very well with a degree of accuracy post processing commands from free dictation utterances.

Command dictation should be weighted higher than general dictation when decoding.

I work with a little under 1500 of commands dragon naturally speaking. DNS is hot garbage as a program despite it has the best accuracy to date with the feature of commands and dictation in one utterance. You get to pay $750 for the privilege m

I've yet to see a free and open source speech recognition engine that can handle both dictation and commands with a high degree of accuracy.

Please please let me know if there's alternatives out there. I would definitely pay to support an open source project like this that focuses on command and dictation.

Most solutions out there that are open source nowadays focus so much on iot command recognition with intents. That's not well suited for controlling your computer with grammars containing voice commands.

Is 30s the input size set by the model, or programs that wrap the model? Is it how it's trained?

It's a property of the model itself.

Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder.

https://openai.com/research/whisper

Y’all should turn that into a product, or at least open source it and get the positive PR + helping others

Y’all should turn that into a product, or at least open source it and get the positive PR + helping others

There you go. https://github.com/dictation-toolbox/dragonfly

I recommend checking out: https://talonvoice.com/

It's not open source nor does the author intend to open the stack.

Check out Willow! It does essentially this, using WebRTC. It doesn't handle the near-real-time response yet, but it does stream the audio to the server and the change would be pretty minor.

Check out Willow! It does essentially this, using WebRTC. It doesn't handle the near-real-time response yet, but it does stream the audio to the server and the change would be pretty minor.

Simply voice to text is not what's needed for dictating commands. Unless I can load commands of on the fly and decode utterances that may be useful.

The client would need to be able to send its commands to the server on the fly.

Do they need realtime transcription?

Computer: webcaptioner.com Android: Live Transcribe (g.co/livetranscribe) iOS: Live Caption with the 'mic' icon enabled.

Web conferencing: Meet, Zoom, Teams all support realtime CC, which is pretty good.

Does "reduce toxic words" and "promoting safer communication" mean that if you say something wrong about LGBTQIA+ people it will 'correct' what you say?

I'm not sure I want the latest twitter trend to be involved in the design of my translator...

Hi, I work on seamless. What this refers to is added toxicity mitigation. We try to detect the level of toxicity in the input and make sure that the output toxicity level is not higher. This protects the model from doing egregious errors in the translation.

There are more details in the paper if you want and the mitigation code is all open source if you want to check what it actually does.

That's an awesome feature. I think one of the worst possible outcomes of machine translation is something that ends up being accidentally offensive, and this is a smart way to mitigate that.

Or maybe we'll finally come around to the idea that being offended by words doesn't make a lot of sense.

I'm sure you can understand why translating "I love you" to "I love you, bitch" is probably undesierable.

This will happen at the same time we stop being uplifted by words, or moved by them, or brought to tears by them, or fall in love over them.

one of the worst possible outcomes of machine translation is something that ends up being accidentally offensive

The Hitchhiker's Guide To The Galaxy claims the opposite:

"Meanwhile, the poor Babel fish, by effectively removing all barriers to communication between different races and cultures, has caused more and bloodier wars than anything else in the history of creation."

How do you account for colloquial (non-English) language which could be naively misconstrued as toxic?

e.g. "geil" (either cool or horny depending on usage) in German

It's not fundamentally different than e.g. "wicked" in English, but the biggest bias that potentially all these ML models exhibit is predisposition towards Anglophoneism

Our goal is to have a good recall, sometimes to the detriment of precision, so for words with multiple meanings, it might consider them toxic when in the actual context they are used in, they are not. The toxicity mitigation algorithm will search for alternative translations that have the correct meaning but not the potentially toxic word so that there is no added toxicity in the output. This means that sometimes the model might prefer a less coloquial phrasing than what a human would.

You can find details on how the multi-language creation of the toxicity lists was done in section 7.3 of the NLLB paper: https://arxiv.org/pdf/2207.04672.pdf. TLDR: it's not just a translation of a base English list, even if we started from that, each language has a curated list that was built by professional translators.

That's significantly less myopic than I pessimistically assumed. Thanks!

Is there an ability to turn it off? If you're translating an R rated movie with criminals who swear a lot, is it possible to get non-toxic filtered output to make sure it's being translated properly?

it only kicks-in if the output is more "toxic" than the input. If the input has a lot of swear words and the output has the same amount, then it will be left alone.

What this refers to is added toxicity mitigation.

Oh, well that clears it up! </snark>

I don't see any definition of 'toxicity' on the landing page - it seems to be one of those 'I know it when I (hear) it' kind of words... unless there's some widely-accepted definition in this area of study?

Sorry if I wasn't clear, internally we've been talking about it a lot, but I forgot that it doesn't have such a solid definition outside of our work. Thankfully, we try to define it in section 7.3 of the NLLB paper: https://arxiv.org/pdf/2207.04672.pdf

The tldr is that if you say: "Thank you for this job offer." you wouldn't want it to be (mis)translated as "Go F*k yourself.". But if you do say "Go F yourself", you still want it to be translated as that.

What about the inverse?

Can it make sure that the output toxicity level is not lower than the input?

If not (which I strongly suspect is the case), then that is unacceptable. We cannot fight toxic narratives with ignorance.

Your comment seems to imply LGBTQIA+ is just a Twitter trend, versus people's lived experience and lifelong identity. This is as unnecessarily judgment as small identities claiming that straight people must self-identify cis.

There is no moral superiority to deny or force label other people's identities. You're an attack helicopter? Great, roger dodger, let's go get coffee Seahawk.

No one is seriously asking for litter boxes in school bathrooms or helicopter refueling stations.

No one is seriously asking for litter boxes in school bathrooms or helicopter refueling stations.

This feels a bit out-of-nowhere.

My read on parent comment was that "Twitter trends" are fast-changing norms about what language is (un)acceptable. They were not saying that LGBTQIA+ identity itself is a trend.

Perhaps so. In light of yesterday's Russia announcement for labeling the "international LGBT public movement" as terror extremists, I think we should be careful what we label as fads or (worse) insidious activity. Source: https://www.themoscowtimes.com/2023/11/30/russia-bans-intern...

You seem to me to be arguing against points no one is making. You're taking the word "trend" and extrapolating it to "fad" and "insidious activity" - both of which have very different meanings and connotations to the phrase "Twitter trend".

The original comment you replied to made the point that they don't want their own personal expression curtailed or modified according to someone else's opinion of acceptable speech.

As someone who repudiates Russia's policies, I support and agree with their point.

Thinking more directly about the subject in hand, what if we took their comment text as an example, and input it into a "responsible translation model"?

Taking what they wrote as harshly as possible, a translation model's output might include narrative elements from the transphobic judgement you are concerned about. That would be a problem, because it would amplify transphobic narratives.

Taking what they wrote as favorably as possible, a translation model's output might rephrase what was written, such that a pro-LGBTQIA+ inclusion narrative is more eloquently expressed than the author actually intended. That would be a problem, because hiding the reality of transphobic narratives would remove our ability to recognize and talk about them.

To make this even more complicated, what if we are using this model for real-time dialogue? What happens when someone says something vaguely transphobic, their words get translated to an inclusive narrative, and you continue that inclusive narrative in your reply? Should the translator alter your words to be transphobic? If it doesn't, then will the entire conversation go off the rails, or will both parties continue, oblivious of each others' ideological subtleties?

---

I don't believe for a second that a model could be trained to avoid toxic narrative and translate accurately.

Hallucination is a feature, not a limitation. The sooner "AI" narratives can accept this reality, the better.

the site makes it pretty clear in multiple places that they're talking about "added" or "hallucinated" toxicity. maybe your culture war outrage is misplaced?

Ok so I know nothing about how this works. It seems like if the model was able to properly detect words in the first place, it would never hallucinate 'toxicity'; if it can't recognize the word with high probability, how will it know whether the speaker actually said $toxicWord or whether it should print something else?

Perhaps it's taking a Big List of Naughty Words and weighting them so that the system must be "extra sure" that's what the speaker said, or else fall back to a G-rated word?

Maybe it's for preventing unwarranted fucks[1]? Translation is more than just concatenating dictionary definitions, and machine translations routinely make this kind of out-of-place and technically correct lookups.

1: https://www.google.com/search?q=engrish+fucking+sign&tbm=isc...

Meta employee here. The system is not perfect, or it would not "hallucinate", while it's pretty good, it does sometime make errors (not just hallucination, maybe just some mistranslation due to noise in the training data). What we want is to avoid these errors to introduce toxicity (think swear words) that weren't in the input as this could be very bad for the user. There is a separate system that double checks the output (compared to the input) and tells the translation model to try again if it's too bad.

Their video said it was to reduce toxic word hallucinations, which does seem admirable/useful. I'm testing real-time translation in a church setting, and I've witnessed whisper hallucinating profanity, which is quite undesirable.

It also happens to be quite hilarious.

“Toxic word hallucination” would be a great punk rock band name

Please don't use Hacker News for political or ideological battle. That tramples curiosity.

From the hackernews guidelines

Your framing of basic respect as being a "twitter trend" is... bizzare.

My wife was training to be a professional voice actor to do dubbing in several languages when we met.

I told her then that the industry would be disrupted by AI before she retired.

Glad she pivoted. Really impressive results.

It won't replace high-end talent, I don't think models can replicate the nuance for a long time, however the entire low-to-mid end of the market is going to get nuked from low earth orbit

It will absolutely replace high-end talent. Anything that a human can do will be able to be done 10x better by a model -- especially in such a narrow and well defined domain.

Did you hear the output examples? Yeah, I think not. I mean, definitely on the way, but there's no way if you need quality acting in your dub that you're going with this.

These are models specially tuned and sized for near real-time, instant translation. It would be naive to think that there aren't technical creatives building and training models tuned for expressiveness and nuance in a more controlled environment.

i think the key word is will.

a few more years of improvements if they happen could be disruptive

That's what they gave us plebs. To think they don't have a superior one they can sell...

Maybe not in the current state of the model, but judging by the rate of improvement we’re all seeing it’s just a matter of time (and data+compute+research obv).

It won’t replace it, but it’s very likely to supplant it, just about destroying the segment by reducing demand by being good enough and so much cheaper, especially as people get more used to it.

Typesetting. Music engraving. Bookbinding. The quality of all these fields have been materially harmed by advancements.

Computer typesetting has, by and large, been a significant regression, though the gap has largely been made up now if you make the right choices.

Published music scores used to be set by experts. Now they’re set by novices using software that is mechanical in method and generally quite insipid. Most are atrocious compared to the old masters, and mediocre at best compared to the typical published scores from a hundred years ago; and very few popular scores are really good (… and if they are, there’s a reasonably high chance they’ve used GNU LilyPond, which has focused on this problem). But the barrier for entry is so much lower, and people have got used to the inferior results, so I don’t know if anyone engraves music the old way, and even people that know better largely just shrug and make do with the new. Like with computer typesetting, there is hope because things have slowly improved. But most will continue to be mediocre.

Books used to be bound with cold glue. It takes time to set, but the results are very good, supple and long-lasting. Then along came hot-melt glue, and it’s just so much friendlier for cheap manufacturing because books are finished within a few minutes instead of a day or two, that I don’t think anyone produces books the old way any more, even though the results are abysmal in comparison (compare the binding and reading experience of a paperback from the ’40s or ’50s with one from the turn of the century; no one after tasting the old will desire the new; for he says, the old is good). But they’re just (barely) good enough. Unlike the other two, I don’t think there’s any hope here—the regressive advancement crowded out the superior but dearer option so that no place was found for it.

You can still get relatively good published music scores from a few of the old German shops (Schirmer, Henle, etc.), but they are very expensive. They are a joy to use when playing, though, since the music is very clearly laid out and page turns are in the perfect place, etc. Finale and Sibelius are controllable enough that you can use them to do fantastic layout, but many people either do not understand how to make a score readable or don't care enough.

That, and what GP describes, is what I see as the overall trend of the market to hollow out the middle. It's not just about technology (though it plays a big role), as all optimization coming from competitive pressure - materials, processes, business models, marketing.

What seems to universally happen is, the market bifurcates - one part is in a race to the bottom, the other (much smaller) aims for super premium tier (overpriced quality), because only those two positions are sustainable, once the race-to-the-bottom side drags all the economies of scale with it. So as a consumer, you get to chose between cheap low-quality garbage that's barely fit for purpose, and rare, super-expensive, professional/elite high-end products. There is no option for "good value for reasonable price".

This has been happening to everything - software, furniture, construction, electronics, vehicles, food, you name it.

I wonder which will happen first - AI evolves to work well at the high-end, or high-end humans retire and there’s nobody left in the low-to-mid end to fill their shoes…

Given the modern trend of on-screen actors doing voice work, I think there will be a supply of talent for at least a few more generations.

I'm using AI for training videos for my startup. Never going back to voice actors outside of primary marketing videos. tThe sheer convenience of write/listen/tweak cycle on scripts is insane. In minutes you can do a voiceover which would have have taken hours + days delay prior.

Sure the final result sounds slighty robotic. 99% of people wouldn't care, and you can get more training videos done, faster for a fraction of the cost.

[Edit] And I'll add the difference from 6 months ago is noticeable to today. I imagine every 6 months we can just re-download updated voiceovers and every 6 months will sound just slightly more polished..

I told her then that the industry would be disrupted by AI before she retired.

Yes. I just discovered there is a text-to-speech addon [1] (now a few months old) for World of Warcraft that adds voices for every NPC in the game... It is so impressive and game changer (pun intended) that I naively asked in the chat of the Twitch stream I was watching "when did Blizzard add voices to the NPCs??". For an instant I really thought Blizzard contracted actors, but no, someone like you and me just used AI to generate realistic voices for every character in the game. I don't think it's ready yet to completely replace actors in video games (surely it will in the near future tho) but voice acting is something so expensive to do that I can see studios and developers in 2024 already use this tech for all the optional dialogues and secondary characters' voices.

[1] https://www.curseforge.com/wow/addons/voiceover

I've wondered at what point this would happen. I think it could now, but from what I've read the voice actor unions are able to prevent it currently (at least for AAA games or non-indie devs). Many of them have agreements/contracts in place for the foreseeable future, and being the first big company to replace them is a heap of terrible press that nobody is going to want to touch. I think it's the same reason Hollywood reached the AI agreement recently too.

I think that there are a lot of voice actors that are not unionized tho. And games like The Finals already use AI for many voices.

Another recent example, the finals uses AI voice generation for realtime game announcements

https://youtu.be/kZ87wiHps9s

What did she pivot to? I don't think any currently existing job is really safe in the medium-to-long term.

My wife is paying our mortgage teaching English on Preply. I'm extrememely worried about where we'll be in 10 years.

We can’t be that far off from almost perfect real-time translation. There is some latency of course to hear and process

Differences in verb-subject-object word order will always add latency. If you want to translate from German, with the verb at the end, to Welsh, where the verb goes at the start, you'll have to wait for the complete sentence before you can begin.

It's very impressive what simultaneous interpreters can do. They don't wait for the end of the sentence.

Yeah they backtrack on branch prediction failures.

What kind of heartbleed that must introduce.

You mean meltdown/Spectre?

probably, but you got the gist anyways

Even they struggle with jokes though.

This may be apocryphal but I’ve heard that in formal settings (e.g. UN) they won’t translate it and will instead give instruction on when to laugh.

Not necessarily true, for the first few sentences you won’t be able to do it. But afterwards, once the context is established you don’t really need to wait for the verb, you can predict it. For example if you are speaking about cleaning the house and you detail that you have cleaned the kitchen the stove and so on, you can predict the verb with only the start of the sentence. I don’t have any source to back this up, but it sounds plausible

What if the predicted verb was incorrect, but the model has already translated the incorrect prediction? How does it tell you about a mistake?

A good approach might be to start with how top notch, ultra-experienced human translators handle corrections for real-time scenarios, for example, the expert translators that do the ear monitors at the United Nations. I've worked with a few such real-time translators when preparing keynote speeches and they seem to have rigorous processes that appeared quite deep. Probably a ton of domain expertise to be captured there.

That said, I suspect that real-time language translation is always going to be somewhat imperfect due to its nature. Non-real-time translation of literature is still a subjective art form even at the very high-end of human expertise.

Once you start predicting what someone is going to say you are no longer translating their speech

Yeah but then you're just introducing branch mispredictions which will cause latency and potential confusion down the line.

It's all a trade off.

Either way it's extremely exciting that we get to even discuss this stuff as real possiblities.

Although true and considering what “mrob” had also replied, this will never mean full translation every time, all the time. This will work with specific environments and linguistic expectations.

I’ve been learning german since 8 years, and the amount of expressions and different ways to say things around the country is impressive. There’ll be a “interpretative” real-time translation, but it won’t guarantee fully understanding in so many cases, maybe ever.

Other thing, and we have this in common with all languages, is the context and this is difficult to address i believe.

Nevertheless, it’s impressive how far we’ve reached and i acknowledge the usability of these tools. However, human knowledge will be always crucial and primordial if we want to guarantee full understanding.

I’ve been learning german since 8 years,

"Since", as used here, would lead me to guess you are not a native English speaker?

Try the demo here, you record a video of yourself and it does voice cloning and a comparison:

https://seamless.metademolab.com/expressive/?utm_source=meta...

This research demo is not open to residents of, or those accessing the demo from, the States of Illinois or Texas.

Interesting mix.

Likely related to biometrics laws. I know Illinois has restrictions on the collection of biometrics, not sure about Texas. Facebook in particular paid out a significant amount of money in a class action in Illinois, I know because I got a chunk of change from it.

which you mean someone took a dime and carved off a piece of it, and then sent you a piece of paper with postage that cost more than the value of that chunk? yeah, we all got hosed by that one too i'd imagine

https://www.nbcchicago.com/news/local/illinois-facebook-user...

According to the Settlement Administrator, payments to class members between $200 to $400 started going in the mail May 9.

I got a $0.19 check from an iTunes settlement once, but this wasn't one of those cases.

Illinois has a facial recognition / cloud biometrics ban. Familiar face detection for doorbells etc. isn't allowed there. Wonder if Texas has something similar?

Ah, that makes sense.

In Texas it seems to be part of AG Paxton's culture war stuff. https://www.texastribune.org/2022/05/12/texas-face-filters-i...

It’s because of https://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3004&C...

Facebook has had to pay out hundreds of millions of dollars in settlements for related class-action lawsuits, and rather than trying to get informed consent, they’re deciding not to collect biometrics from residents of those states.

Well that was spectacularly bad. Failed to translate a single word from english->spanish. Admittedly I was using George Carlins favorites, but if you're trying to have an expressive language translator that refuses to translate "fuck" then what you've got is bullshit.

As someone working in tech and following along the progression of AI, I believe I have the right expectation. But still feels surreal seeing myself speaking a foreign language in my own speech style.

And that demo is now overloaded and fails to translate the input :D

How will Meta put these models into practice? I understand why Google and Apple have models for their mobile OS users, but I don't understand where users for Meta speech models come from. Are they planning to show Instagram videos with English narration in French or what?

Ads and Reels (their TikTok competitor) I imagine would be the primary use-case. Imagine spreading the "wonders" of TikTok-like videos to non-$native_language speaking world.

but isn't that a TikTok shtick to use the obviously fake voice in your video?

The metaverse will not have any language barriers...

Ads in any language!

They have arguably the most diverse userbase of any company, with users from pretty much every single country + language across all their services & apps. I could easily imagine a handful of use cases having a high performing universal translation model would be incredibly useful.

If "toxic word hallucinations" isn't a cyberpunk phrase I don't know what is.

(quote from the video presentation in the link)

Oh god they’re gonna censor the output. Time for musk to make a non-censored version lol…

I am sorry Dave, "merde" is not in the pre-approved word list

I wonder if it doesn't understand the common colloquial usage of "geil" in German. This sounds like it is going to mess up natural language

I thought he misspoke

I thought it was meant to be "toxic words, hallucinations, etc" in the script

Next step is combining the output with few-sample speech synthesis so the output is in the original speaker's voice!

This does that already. At least, to a first approximation. Voice cloning is not that great in general right now.

Voice cloning works pretty well already but not necessarily on one 10 sec sample as the source data. If you can give it some hours of data it’ll work much better

Do you have examples of it working well? I haven't heard anything that really impressed me. Nothing close to a good human impersonator. We're a long, long way from replacing voice actors, even considering the rapid rate of progress.

The voice cloning worked pretty well for me. From english to spanish I noticed that the first few words sounded more like me than the last few words. Also it doesn't sound like how I speak in spanish but that's expected.

Yet again, Hindi (the major language in India) is not even in the samples. India is the largest user base of facebook (and probably 1/3rd of the engineers working there are Indians) but never will facebook put enough effort to contribute back. Only use the DAU from India in investor calls.

By "samples" do you mean examples on the marketing/landing page? It sure looks like the model supports many major Indian languages like Telugu, Tamil & Kannada. https://huggingface.co/facebook/seamless-m4t-v2-large

Yeah, I kinda agree with the spirit of your comment; it sure would be nice to see a major Indian language like Telugu on their landing page for sure. But that's just my Indian-person bias speaking.

slightly new to this area

is there an example notebook that shows how we can use this model with own sample audio and text? thanks!

I work on seamless and you can find sample code here: https://github.com/fairinternal/seamless_communication or in the HuggingFace space.

The lack of focus shows up in the results. The models never performs as good as french or spanish on Indian languages. This goes for Google, too.

If you are multilingual but have young children and plan to continue residing in your current English speaking country for the foreseeable future, are you opting to teach your children those additional languages or are you adhering to the idea that they can always learn those languages later if necessary, considering it might not be essential (esp with models like this)?

It is easier to learn multiple languages when you are young.

There isn't a lot of good evidence behind this popular conception.

If anything, the evidence is that it isn't true, see https://journals.plos.org/plosone/article?id=10.1371/journal...

Any apparent causality of age of acquisition seems to be a proxy of hours of exposure. It may well be that it is easier for young people to rack up a lot of exposure to a second language, but not much evidence that age plays much of a factor for people of different ages who had the same degree of exposure.

we argue that the late learners resort to computationally less efficient processing strategies when confronted with (lexically determined) syntactic constructions different from the L1.

we show that the ERP signal in response to grammatical violations depends on the AoA of an L2 learner, as well as on the regularity of the structure under investigation. In (lexically determined) syntactic constructions different from the L1, we found a gradual change in processing strategies that varies by AoA, with a native-like effect for early learners and a less efficient neural processing strategy for later starters.

Although they do clarify that these effects could be confounded with age of acquisition instead of it being the cause.

Does the spanish expressive sample sound muffled for others too? And the french sounds super mechanical. Hopefully, it's more impressive the other way.

Also: "This research demo is not open to residents of, or those accessing the demo from, the States of Illinois or Texas"

Illinois is possibly because they don't allow storage of biometric data without express permission and I believe explicit usage restrictions. So I bet they're keeping all of your utterances, which would violate that law.

Yes, they all have significant 'ghosting' artifacts where the harmonics are a bit fuzzy if you listen closely. AFAIK all of the recent neural speech engines have this, from SoundStream to EnCodec, especially in low latency causal setups. Wavenet was a bit better in that regard but has fallen out of style due to complexity and the lack of a bottleneck. It seems like something diffusion post processing would be able to clean up.

The "expressive" example in french exhibits a thick accent which bothers me more than the mechanical aspect of the non-expressive french example.

It's not dissimilar to some kind of a "ch'ti" / "chtimi" accent or a belgian-french accent (which is not dissimilar to the french ch'ti accent, heard in some part of the north of France. "Ne partez pooooo" (with a longer "a" which sounds nearly like an 'o': that's not proper french at all) instead of "Ne partez pas".

That's said I'll take the non-expressive accent any day over subtitles for when watching video in a language I don't understand: it's clearly good enough.

And just the other day StyleTTS[0].

Just text to speech has gone too far. Audio books would be mainly generated on the fly like this?

I think some RPGs in some 5 years time might have something like this:

- A text file that outlines characters and a lose plot/Story line. Human written.

- 3D Mesh Generation based on character description via Transformers based models. Auto generated.

- Dialogues for each NPC via LLM.

- This TTS engine again based on such models.

Result - almost unlimited replayability. Or even edit text file, have a new world based on a new story line with characters having different personas.

[0]. https://news.ycombinator.com/item?id=38335255

How has TTS gone too far?

Came a long way, that is. From the days of let's say if I recall correctly, from Windows 98 screen reader.

For audiobooks, it's already a reality: https://marhamilresearch4.blob.core.windows.net/gutenberg-pu...

It’s amazing how far text to speech has come in the past few years, but what I’m wondering is when this tech will finally make it into local TTS engines baked into the OS (eg for screen readers, etc)

This is already built into recent iOS devices and it’s called Live Captions.

Same with Android (Pixel phones at least).

I'm the most excited for an open source one though, and it would be incredible if this could become it. I do 95% of my compute on desktop linux and it sucks being behind.

The accessibility nerd in me is excited!

Can anyone help demystify the licensing?

Besides the ACCEPTABLE_USE_POLICY, there's a CC BY-NC 4.0 (NonCommercial) license, a 'SEAMLESS_LICENSE' (NonCommercial), but also an MIT license? It would seem these other licenses contradict the MIT license, could somebody help clarify how these all interact in practice?

The license details are listed on the project GitHub

https://github.com/facebookresearch/seamless_communication#l...

MIT for the code, NonCommercial for the trained models I bet.

Impressive work, really excited for this.

I will note though that I feel safer getting an occasional bad word than I do having a translator straight up deceive me.

For example, "what the fuck" in English->Spanish is giving "qué diablos" output. Definitely toning down the meaning there.

If someone says something mean to me, I want to know it.

This may be an intentional decision given that there are several ways to say "what the fuck" in Spanish, such as "qué mierda" or "qué carajos". And that's not including regional expressions like "qué coño" or "qué chingados". So, saying "qué diablos" may be the most common expression across dialects conveying the same meaning.

Yeah could be, I still need to read the paper to better understand the safety tuning.

Would be interesting to see some work stress-testing the ability to convey ill-intent across multiple languages. Accurately conveying ill-intent is safety-critical for the person being threatened.

How far from a real-time Star Trek translator? Whisper is fast enough and light enough, LLMs are getting there, so it’s close isn’t it?

Seems like there will always be latency, because it's not possible to easily stream over languages that have different structure. You need to wait a bit before you can start faithfully translating the meaning.

They also mention it in one of the videos about the streaming variant of their translator. But I guess 2s delay or what they mention is close enough for practical purposes.

I feel like for personal relationships where true real-time is required, having a computer intermediary would be weird anyway and you have to learn the language, at least for the time being and as long as personal relationships are still relevant (in the post-AI world they might not be).

You need to wait a bit before you can start faithfully translating the meaning

I guess it's possible that the AI learns about a specific person over time? That way it can be confident about what's being said as soon the person starts saying it

Did anyone compare this to nllb (also meta) yet?

in the paper, the results reported show very similar level of quality

We're the same team! We have some comparisons in the paper.

Any more info about the watermarking? Only Meta can make the determination?

Edit: I can’t find the weights but if I’m reading the paper right anyone could train their own detector.

Hey! a RS from Meta seamless team here.

Yes, we chose not to release the watermark detector to safeguard against adversarial attacks. This decision helps prevent any attempts to erase the watermark by malicious users.

The watermark generator and detector are trained together, one can use the information in our paper to train your own generator and detector model, however in this case the watermark signature created will be distinct from the one we use to protect our seamless translation models. This approach ensures each model maintains its unique security features.

Thanks for clarifying, and seems like a completely reasonable approach. Thanks for the great work.

Currently Steam bans games from using AI-generated assets (for good reason). I wonder if they'll back track on this or carve exceptions because this tech seems really useful for indie devs to add voice work to their otherwise silent games.

Very speculative amateur opinion: My understanding is that Valve didn't exactly ban AI, they banned AI that was fed copyrighted works that could possibly make the results copyright infringement ( https://www.theverge.com/2023/7/1/23781339/valve-steam-ai-ar... ). (Side note: Regardless of individual views on whether AIs are just copyright regurgitaters or not, I can understand Valve being cautious until courts have actually decided.) So if speech models can be made purely from assets that their creators can prove they have the rights to use, it would probably be easy enough to get it approved.

As a French native speaker, I am surprised by the low quality (frankly ridiculous) voice of the French translation example.

Especially because the head of AI at Meta is a French guy AFAIK (Yann Lecun).

They are optimizing for speed (low latency)

RiP elevenlabs?

I tried to do Japanese -> English for multiple audio snippets using the Seamless huggingface demo and all of them output complete gibberish. Really makes me question how many of the languages they claim to "support" are actually usable. ElevenLabs at least produces a result that resembles the input, so they still have the edge in some places.

Like how easy it is to get going but you need to download about 20GB and s2st needs 40GB GPU RAM!

It runs but any audio input (you will need to provide wav not mp3's) I tried (tried 20s/40s/300s) I get just one short sentence returned in target language that seems not related at all to my audio input (i.e. Tous les humains sont créés égaux).

Seems like some default text but it runs on full GPU for 10 minutes. Tons of bug reports in GitHub as well.

Text Translate works but not sure what is the context length of the model. Seems short at first glance (haven't looked into it).

Oh and why is Whisper a dependency? Seems not need if FB has their own model?

Hello, I work on seamless.

It runs but any audio input (you will need to provide wav not mp3's) I tried (tried 20s/40s/300s) I get just one short sentence returned in target language that seems not related at all to my audio input (i.e. Tous les humains sont créés égaux).

You might want to open an issue on github for that one. The model is made to work on short utterances, if you have a long speech, you'll want to segment it first. I've tried "tous les humains sont créés égaux" on the demo: https://seamless.metademolab.com/expressive (which runs the same code as in the repo) and the output was correct. Maybe there is something wrong going on in the conversion of the input audio?

Oh and why is Whisper a dependency? Seems not need if FB has their own model?

Whisper is a dependency as it's used as a baseline for evaluation. You can check out the paper for explanations.

Neat. How translatable are tones of voice for intent across languages? Like does a person trying to do a "nerdy" voice(nasally, whiny, etc.) in English translate to the "nerdy" stereotype for a French speaker. Seems to do very good on whispers which made me wonder what could be next.

If you don't speak the language into which these models translate your inputs, how do you know if or why the model has generated, without being commanded to do so, a campy American gay male sociolect, or an African American regional accent, or some other thing that may convey unintended meaning to native listeners?

LICENSE

Attribution-NonCommercial 4.0 International

https://github.com/facebookresearch/seamless_communication/b...

Took me 2 minutes to find the Github.

I had pretty terrible results when I tried English -> Swahili I'm using the Huggingface M4T V2 spaces, it pretty much doesn't work most of the time and I just get English back with a different voice, Expressive on the other hand only has a few languages it seems.

It would be nice if they could layout what exactly is missing in terms of data to make a language work better, while the actual AI bit is out of reach for most of us maybe we could provide more data.

There is also a 60 sec limit and wonder if this is HuggingFace limitation or Seamless?

maybe we could provide more data.

If you want to contribute by recording yourself speaking Swahili, https://commonvoice.mozilla.org/sw is the place to go. Although Meta has access to much larger data sets, they nonetheless use Common Voice as a "known good" source. E.g. the paper on their SONAR speech encoder reports experiments on Common Voice data, coincidentally involving Swahili https://ai.meta.com/research/publications/sonar-sentence-lev...

The Google Translate app has a conversation mode.

How does this compare to whisper-large-v3 on STT?

I work on seamless. You can see the results in the paper. M4Tv2 is significantly ahead (Whisper Large v3 - 16.9 BLEU vs. M4Tv2 26.6). These are averages over 81 directions X->english

This tech from Google seems similar, but doesn't have a fancy demo: https://blog.research.google/2023/12/unsupervised-speech-to-...

I feel like naming something "seamless" is not dissimilar to calling the Titanic unsinkable.

Every video in this page is a bit out of sync with the audio. Combined with the blandness of feature expressions and the whole mood in general, I kept waiting for the moment when the video would disclosure that everything on it was created by AI.

It’s funny, all the humanities types try to push the proliferation of languages, but the engineering types keep trying to reduce the language barrier.

Marketing has been heavily involved in this page...there's at least one coloured person for every white photo..

How did that page get camera access without my permission?

Edit: by the upvote I guess it wasn't just me?

    "The Babel fish is small, yellow, leech-like, and probably the oddest thing in the Universe. It feeds on brainwave energy received not from its own carrier, but from those around it. It absorbs all unconscious mental frequencies from this brainwave energy to nourish itself with. It then excretes into the mind of its carrier a telepathic matrix formed by combining the conscious thought frequencies with nerve signals picked up from the speech centres of the brain which has supplied them. The practical upshot of all this is that if you stick a Babel fish in your ear you can instantly understand anything said to you in any form of language. The speech patterns you actually hear decode the brainwave matrix which has been fed into your mind by your Babel fish.
    "Now it is such a bizarrely improbable coincidence that something so mind-bogglingly useful could have evolved purely by chance that some thinkers have chosen to see it as a final and clinching proof of the non-existence of God.

    "The argument goes something like this: 'I refuse to prove that I exist,' says God, 'for proof denies faith, and without faith, I am nothing.' 'But, says Man, the Babel fish is a dead giveaway, isn't it? It could not have evolved by chance. It proves you exist, and, by your own arguments, you don't. QED.' 'Oh dear,' says God, 'I hadn't thought of that,' and vanishes in a puff of logic."

I was hoping to find out, that the actor's voice in the demo video was generated, or that he had recorded the video speaking in another language or something.

That would have been the knockout punch.

I wonder how well this will perform for automatic comic's translation. Current local models are pretty bad.

make a .llamafile and we'll use it.

Besides the obvious good news about making it easier for people to communicate with each other across languages, it's also exciting to me that we're trending towards a world where I can tap into all the knowledge that only exists on the non-English web. I'm sure there are vast troves of programming knowledge in the Japanese-only web for example. The Chinese-only and Russian-only web are obvious candidates too but presumably those are harder to access for other reasons.

Any ideas on what kind of hardware this would require to run S2ST?

I've been trying (and mostly failing) at settings up a pipeline to get system audio into whisper and feed that transcription into a seamless m4t text-to-text translation model. It seems like seamless streaming is going to solve most of my issues, and should significantly reduce latency!

My ultimate goal is to have realtime translations of video conferences. I've moved to a new country, and while I'm super privileged that most of my colleagues speak English, we still have a number of "all hands" meetings that I get lost in pretty easily.

This is so world changing! Exactly how I wanted to speak so confidently!

Thank you Meta!

Can this do speech to text English -> English? Get strange results if I do a translation to the same language would be an interesting alternative to Whisper if it could.

I don't see how realtime voice translation can ever be possible; to properly translate the first half of my sentence, you need to hear the whole sentence first. I don't know how simultaneous translators can translate from verb-at-the-end languages like German, until they know the verb.

It's not just where the verb is; sometimes I say something ambiguous, and my next utterance is supposed to acknowledge and remedy that. But if that ambiguity doesn't exist in the target language, I don't see how a simultaneous translator can convey the ambiguity, without knowing how the next utterance is going to refer to it.

Maybe that's why human simultaneous translators often seem to stumble or backtrack. I've never met someone whose job was simultaneous translation. It must be very difficult.

I'm impressed by this effort to convey non-linguistic elements of speech in translation. It's quite an achievement, and a very ambitious goal.

Aside: I wish I knew how speakers of tonal Chinese dialects express feeling, when tonality is supposed to convey semantics. When I hear chinese speakers, I can "hear" the feeling, but I don't know how they do it - it can't just be down to emphasis. (I learned some mandarin 50 years ago, at school. I learned the tones, but they didn't teach expression; and I was never taught by a native speaker, although there were language-lab tapes.)

The near-realtime aspect of this is so promising -- we're getting closer and closer to IRL babelfish!

What I would love to see is an ability to add my own voice (yes, at the risk of deepfakes) so that the model could "speak" in any language and sound more like me, not some random voice actor it was trained on.

"We need access to your microphone and camera to record your voice and translate it with your expressions."

None of the videos shows any modified/lip-synced footage. There doesn't seem to be a reason for this thing to need access to my camera.

Also, using it with tape over the camera doesn't seem to work either. (Perhaps it needs to see facial expressions in order to work?)

I want this as a channel in our discord.

Would allow more interactions of people that don’t speak the same language

I'm thrilled to see the progress made in the last 30 years.

As a student in the mid-90s I worked on a system called Verbmobil at the German Research Center for AI and it did speech-to-speech for English, German and Japanese in very limited domain.

This was done via "classical" NLP: You had to model the domain with concepts, you needed sentence parsers, semantic engines, speech-to-text hand-crafted for 3 languages etc.

As it turns out, this approach is/was a dead-end.

It really sucks that a company so irresponsible with all your data is one of the leading AI companies now.

The demo is so much fun to use. I can't wait for all these technologies to start integrating into filmmaking / games.

Wow, after trying out the demo, I'm floored by how high quality this is. The translations worked perfectly, the voice cloning was "good enough", and the emotions conveyed in my voice was retained pretty accurately.

I don't think this would fool anyone that I was a real native speaker of the target language, but for casual conversation this would work pretty much perfectly. It basically avoids all of the traditional pitfalls of machine translation, like the unnatural robotic voice that it outputs, the slow translation speed and huge latency for realtime conversation, and the loss of emotion.

Automatically filters out toxic speech >Watermarking

So it can't be trusted at all then

I hope all these AI products will have privacy focused alternatives quicker than when web2 happened.

Can this also do straight tts or is it translation only? Is t quite clear to me from the site