return to table of content

The Seamless Communication models

ChuckMcM
60 replies
22h16m

I look forward to the day where I'm wearing my headphones in a foreign land and hearing all of the discussions in my own language.

The "universal translator" which was part of Star Trek and a lot of other Sci-Fi I was exposed to as a kid was something I was really fascinated with. My Dad worked as a simultaneous French->English translator and sadly spent long hours away from home and, as a kid, I started trying to build a translator so that it could do his work and he could be home more.

Translation is important work and one that could help a lot of people. It's my hope that we get to the point where these models work entirely on locally carried resources.

rangestransform
14 replies
21h45m

how am i supposed to talk shit with my friends about other people in public then

flanbiscuit
8 replies
21h28m

I'm curious to know how well these models can pick up slang. Maybe if you talk shit in as thick a slang as you can it won't be able to give a good enough translation.

kredd
4 replies
21h11m

With my bi/trilingual friends who speak the same languages, we intermix them to make our point more clear. Don’t think models will be good enough for mixes for a few more years, so we’re safe!

smcin
3 replies
21h5m

Can you show us an example of such a sentence?

kredd
2 replies
19h55m

Hm, think of things like “On va bruncher” (we’re going to brunch). The word “brunch” doesn’t exist in french, but we add suffixes to fit into the sentence. Very common in Montreal. My french isn’t very good to do that on the fly, but my francophone friends do that all the time.

In my other languages that I am actually fluent in, it’s kinda the same — you use specific suffixes to soften or embolden your point and so on. Maybe add “exclamation making sounds in specific language” too. Eventually your nouns and verbs end up in different languages, with different suffixes where it “makes sense”, yet the person whom you’re talking to will “get it”.

Would be curious to try the new Seamless model on such speeches.

dopidopHN
0 replies
17h39m

I would thing this model would fail with a heavy quebecois lingo, as opposed to standard French.

bertil
0 replies
19h40m

This is extremely common for every new technology: “upload,” “download,” “stream,” “google,” “FaceTime,” most code patterns, all the new ML apps, “venmo” or whatever the name of the app you use for payment, etc. all of those are taken as is, slapped a verb termination and it’s good enough. That’s true in German, Danish, Dutch, French, Italian, and Spanish.

The only thing that doesn’t work is if you talk to people too young to remember Skype. Then you feel old.

fasquoika
0 replies
18h31m

Reinventing polari is certainly one way to make yourself less understood...

dopidopHN
0 replies
17h41m

Cockney English and French Verlan comes to mind.

I don’t know for cockney but verlan is very alive.

dontupvoteme
0 replies
21h8m

I'd love to see a map of how it matches up to regional English/British accents and their slang.

ugh123
1 replies
21h22m

learn Klingon?

bertil
0 replies
19h40m

Klingon is definitely going to be in the top 50 languages covered…

csa
1 replies
20h59m

Speak in metaphor and/or code.

I’ve been in mixed language communities in which I wasn’t sure who spoke what, and I have found this to be quite effective when done right.

Good time to reference st:ng “darmok” episode and quotes like “darmok and jalad at tanagra”.

ChuckMcM
0 replies
17h58m

Cincinnati when the Turkeys fell.

buryat
0 replies
18h49m
sacvnsune
12 replies
22h12m

If I am not wrong, Google Pixel buds offer live translate feature.

echelon
11 replies
21h58m

Not in the voice of the original speaker.

stevenicr
9 replies
21h29m

now if I could just get the pixel buds tech to remove the voice of the original speaker and translate some youtube videos from thick accent english into no accent am-english.

keerthiko
7 replies
20h16m

Obligatory, not directed at you in particular since I'm sure you mean no offense, but just voicing a pet peeve:

I grew up bilingual outside the US, and speak English with a hybrid British/Indian/Middle Eastern accent (with some of my personal quirks, and mixing increasing amounts of various American accents over time). I can understand English in nearly any accent (Singaporean, Chinese, Vietnamese, Indian, Nigerian, eastern European) as long as the words involved are globally used and the grammar is passably queen's. Especially after hearing it for about an hour. And people who natively speak English with these various accents usually can understand my English better than they can an average American accent. Yet in this country, my accent is belittled, despite being perfectly understood and more versatile. Even by others who don't speak with the American accent!

This is the problem of the "default accent" anywhere being referred to as "no accent", and therefore anything deviating is considered "having an accent". This makes "accent" a negative trait, scaling from 0-bad to heavy-bad. But if the vernacular were such that we said "American accent" instead of "no accent", then noone's accent is bad, just not used to.

Most of my non-American peers who were raised on English have a better command of the language than my American ones, yet they are mocked for their accents as if they don't know the language, when in reality it's the Americans lack of familiarity with the language (as its used globally) preventing them from comprehending the language.

So yes, put in more work, the world is shrinking and English is the global language (for better or worse). What you're saying is spoken from a position of privilege because the culture allows you to mock others' accents and imply your version of it is the correct one that everyone else should put in work to provide you with, rather than the other way around.

Every time you hear English with an accent other than British, American or Australian, remember that it usually means the speaker knows at least one entire other language as well, probably one that you would sound like an idiot if you tried to speak it. Don't be rude or dismissive of their command of English.

In fact, you were so close — you called it a "no accent am-english", when you could have just called it what it is — "an american accent".

stevenicr
2 replies
18h46m

I appreciate your sharing, and stating that you assume I meant no offense, and that your thoughts are not directed at me specifically.

I could of been more specific, but my request for the tech to vary, I think would lead to specific options for different people.

And actually to be even more.. not sure the word.. I want 'the Chicago accent' I think it's called, or midwest / no accent. Personally as much as I enjoy some entertainment from Jersy / NY accents, I would not volunteer to watch tutorials on tech taught by the Sopranos cast - as funny as that might be (and I get if you are from the NE, you may be learning just fine being taught with such a language style).

As annoying some of the Cali style of language is, I can understand the words and meanings without squinting my ears and spending double the brain cycles trying to understand the words, while then interpreting the meaning, and then trying to put together concepts for understanding new ways of coding or using tech.

I've run into folks in Louisana that I could not understand at all and had to ask for an interpreter at a gas station. From Florida to Chicago to Seattle down to Miss and Ala - I can hear what people are saying and learn without spending lots of extra energy trying to understand.

With that being said, I understand there are parts around Miami where accents may be thicker (or not) - and with some folks even if using the rights words and grammar, I may need to slow down the speech to actually learn if they were teaching a class.

The slow down and speed up options already exist with youtube.

"So yes, put in more work"

- I do try a bit. I don't mind accents with some folks and media.For example I can listen to and enjoy Shankar sharing via the 'hidden brain' series, partially because his accent is limited but also because the media requires less thought intensity.

I have tried many youtubes, and bought a few courses taught from folks in India and other places where I just could not muster the energy. I literally squint with my ears and feel like my head gets hot trying to decipher what is being said, translate into what is meant, and how it should create new patterns of understanding in my brain.

I can only do that for so long and I am done. Now I just skip any learning video that has non-am English speakers. When I consider courses to sign up for or buy, I have to research the authors / speakers and find video of them to hear the audio, because I just can't learn well that way.

"other than British," - True story, a few years ago I had to call an ISP in Britain(?) and the person I got to to file an issue with, I could not understand them. I had ask 'what did you just say' many times. I laughed at myself for even thinking of saying 'can you slow down and speak clearer English please' - I mean, crazy... I was paying by the minute for the long distance at the time and it ended up being a 25 minute call that could of been 10 if I had a magic translate without accent device.

"a position of privilege because the culture allows you to mock others' accents"

- This is truly not about mocking accents, this is truly about my lack of ability to learn well.

Yes, I would defintely sound like an idiot trying to speak another language. Like I said, I do not learn as well as some others.

Truly not my intent to be rude. I apologize if the shortness came off that way, I was trying to be brief in the hope that there's a chance that some tech like this exists and someone here could point me to it. Before I posted, I DDG'ed it and found a couple of things attempting to be in that space with a 'speak to sales' type of 'you'll never afford this' button for info.

I will never be dismissive of anyone's command of English, or other spoken language, or computer language or anything like that. There is no way for me to know someone else's situation and circumstances led them to their current command of whatever language. If someone is trying to learn more at any age; I applaud and encourage them - being rude or dismissive does not encourage more learning.

"no accent am-english", when you could have just called it what it is — "an american accent". - Well maybe, but actually I meant to be more specific, as mentioned a bit above - I mean '"no accent" American accent' - because there are plenty 'American accent' types that I would want removed by a magic earpiece to make it easier for me to understand and learn.

keerthiko
1 replies
18h23m

I appreciate the thoughtful reply. I don't think you're rude, and I get what you're saying as someone who thinks a lot about accents and languages. However, I still think you missed my point.

There is no "no accent". An accent is a baseline feature of intelligible human speech, like a voice, or a volume, or a language. You can't say stuff without those features. When you say "the Chicago accent", or the "Midwest accent", that's an accent! Not "no accent".

I understand it's common usage to refer to the default "radio accent" as "no accent", but in a country like America, all kinds of people with all kinds of accents speak English. Reinforcing an expectation that a certain (usu. majority-white-spoken) one is the "default" by referring to it as "no accent", implicitly suggests all others are erroneous affectations, even if I trust that is not your personal intent.

All that said, I think your idea for a translation device capable of revocalizing what is said with an unfamiliar accent into one you are used to is not a bad one, and likely easier than translating between languages while retaining expressiveness.

kortilla
0 replies
14h40m

reinforcing an expectation that a certain (usu. majority-white-spoken)

Wow, you just keep digging in don’t you? When these Americans you deride say “no accent”, do you think they are referring to the “majority-white-spoken” Scottish accent?

No, of course not. Get that race baiting out of here.

zer00eyz
0 replies
17h3m

https://www.bbc.com/culture/article/20180207-how-americans-p...

What accent? Whose accent? Brits are as diverse accent wise as Americans, London, cockney, New England, Southern...

A lot of Indians that I know have a very "proper" British accent, one that maybe a bit aristocratic, its quite an irony for a former colony. https://www.bbc.com/future/article/20220915-what-the-queens-...

The context matters, but so does history.

lstamour
0 replies
17h38m

There is another way of looking at this, in the context of the parent post: we could suggest that any accent could be converted to “no accent” where American accents are converted to British, or where standard Japanese is converted to a Nagoya pronunciation. Whatever seems like your preference of “no accent”. With this interpretation of the parent post, it’s not specifically about any particular English accent. I’ve been told by others that I have an accent yet I think I don’t have one - and honestly, I think most people have either encountered this - having an accent when you think you don’t have one - or haven’t travelled enough! :)

And I mean, yes, there are people who know they don’t sound like whatever ideal accent they have in mind, and there are people who will make fun of accents - but, and I can’t stress this enough, depending on the context literally any accent can be made fun of, sadly. I’ve had people mock my “American” accent while travelling, for example. It sucks, but it’s not easy to single out any accent as “default” unless it’s literally enforced by a government and taught that way in schools. Last I checked, the US is not one of those countries and English is not as centrally controlled as e.g. French can be.

kortilla
0 replies
14h51m

This would carry some weight if you didn’t take an opportunity to take a shit on Americans’ English in the middle.

Dylan16807
0 replies
16h20m

In fact, you were so close — you called it a "no accent am-english", when you could have just called it what it is — "an american accent".

There are many american accents. Your suggestion makes the sentence much less clear.

And by specifying "american" they're already making it clear there is no such thing as a universal base accent for english.

ChuckMcM
0 replies
21h19m

This is a really interesting use case. I could definitely see this as a service for content providers to get more reach and I think you could justify a subscription price for the service based on this.

By keeping creating speaker specific tonal ranges and profiles you maintain the better cohesion on the final product.

scotty79
0 replies
18h7m

It would be really cool as an assistance in practicing correct pronunciation and accent. Hearing your voice saying it right and then hearing how you actually said it the last time you tried might help you to get both into alignment.

mbforbes
12 replies
16h8m

I worked on building exactly this earlier this year. I was hanging out in Taiwan for a few months and thought, surely the Babel Fish should exist by now.

I did several experiments recording from all the microphones I could on my iPhone and AirPods while out in the wild. My conclusion: it's impossible right now for that hardware given the microphones we have and what they pick up.

So much of what's spoken is at a combination of (a) high distance (b) low volume (c) background obscuration. Something that was clear as day to my ears would barely register on the mics. While context is of course an issue, the raw audio didn't have enough to even translate.

The one caveat is that there might be low-level (i.e., Apple-only) access to headphone microphones that capture the environment to do noise cancellation. I'm not sure though---I couldn't find them on any API.

For cases where you do have clear audio, existing apps (e.g., Google Translate) are so close to achieving this, but don't let you specify audio outputs with enough fine grained control. By default, it will start screaming out of your phone what you were attempting to silently translate.

godelski
6 replies
14h9m

There's also some magic to the Universal Translator and Babel Fish: they perform zero-shot real time translation.

That is, they are able to translate (in all directions) novel languages that were not previously heard[0]. It is an open question, with likely a negative answers, that there is a universal grammar even among humans[1] (the definition itself is vague but even the most abstract version is suspect and highly likely to not be universal across species). I think no one will be surprised if it is always impossible to interpret an entire language based on only a few words (let alone do it in real time)

This isn't a knock down, because even a trained device is insanely useful, it's just a note about limitations and triage. This is awesome stuff and I can't wait for the day we have transnational headphones. It's an incredibly complex problem that I'm sure is not short of surprises.

[0] There are a few exceptions such as Star Trek TNG's episode Darmok, S5E2, where the Tamarians' language is unable to be translated due to its reliance on cultural references (the literal words are translated but the semantic meanings are not). It's a well known episode and if you hear anyone saying "Shaka, when the walls fell" (translates to "Failure") they are referencing this episode (often not using the language accurately but who cares (nerds. The answer is nerds)).

[1] https://en.wikipedia.org/wiki/Universal_grammar

geoelectric
5 replies
11h56m

Can’t speak for ST, but did they ever say the babel fish understood languages it never heard before? I thought the galaxy was just exceptionally well-cataloged, given the HHG itself, and humans were hardly unknown.

civilitty
4 replies
11h52m

The babel fish translated via brainwave energy and a telepathic matrix:

> The Babel fish is small, yellow and leech-like, and probably the oddest thing in the Universe. It feeds on brainwave energy received not from its own carrier but from those around it. It absorbs all unconscious mental frequencies from this brainwave energy to nourish itself with. It then excretes into the mind of its carrier a telepathic matrix formed by combining the conscious thought frequencies with the nerve signals picked up from the speech centres of the brain which has supplied them. The practical upshot of all this is that if you stick a Babel fish in your ear you can instantly understand anything said to you in any form of language. The speech patterns you actually hear decode the brainwave matrix which has been fed into your mind by your Babel fish.

jasomill
1 replies
11h6m

“Now it is such a bizarrely improbable coincidence that anything so mind-bogglingly useful could have evolved purely by chance that some thinkers have chosen to see it as a final and clinching proof of the nonexistence of God.

“The argument goes something like this: ‘I refuse to prove that I exist,’ says God, ‘for proof denies faith, and without faith I am nothing.’

“‘But,’ says Man, ‘the Babel fish is a dead giveaway, isn’t it? It could not have evolved by chance. It proves you exist, and so therefore, by your own arguments, you don’t. QED.’

“‘Oh dear,’ says God, ‘I hadn’t thought of that,’ and promptly vanishes in a puff of logic.

“‘Oh, that was easy,’ says Man, and for an encore goes on to prove that black is white and gets himself killed on the next zebra crossing.

“Most leading theologians claim that this argument is a load of dingo’s kidneys, but that didn’t stop Oolon Colluphid making a small fortune when he used it as the central theme of his best-selling book, Well That about Wraps It Up for God.

“Meanwhile, the poor Babel fish, by effectively removing all barriers to communication between different races and cultures, has caused more and bloodier wars than anything else in the history of creation.”

blooalien
0 replies
7h44m

I couldn't help but hear this in my mind as it was read in the voice of the narrator from the old BBC "Hitchhiker's Guide" mini-series.

passion__desire
0 replies
1h33m

I think idea of Babel Fish might encroach on the computational complexity limit in some sense. Imagine a future "Theory of Everything" book written in alien language. The book has total of 1 million characters across its pages where each character is distinct. Now Babel Fish must be able to "translate" such a language to English given its oracle like powers? Can it do the job?

geoelectric
0 replies
10h46m

Well, then. Magic indeed!

KPGv2
3 replies
13h23m

Also a lot of spoken language involves context that AI is nowhere near understanding yet, let alone all the cultural baggage necessary to accurately translate/localize a lot of utterances.

"Can you stand up?" would be translated differently into Japanese depending on whether you're implying you need them to move their butt off your cell phone versus directly inquiring as to the function of their legs after a car accident. If you speak English and hear it as a background without the rest of the context being picked up, your brain instinctively knows it can interpret it either way, no problem.

But if you're Japanese and the AI picks a specific way to translate it, then you are completely unaware of the ambiguity because the AI resolved it with a 50% chance of being wrong.

pxoe
2 replies
8h25m

"Can you stand up?" would be translated differently into Japanese depending on whether you're implying

nitpicky, but is it though? not really. and it's as much 'difference depending on what you're implying' as there would be in english comparing just saying 'can you stand up' or specifying 'from the seat/at all'.

resonious
1 replies
7h29m

Probably not the strongest example but there are definitely phrases that are specific in one language but ambiguous in another.

youngNed
0 replies
6h9m

There are certainly nuances, even when 'understood'

Google: "A bit sticky, things are pretty sticky down there."

civilitty
0 replies
11h53m

I'm on mobile so can't find the link but years ago there was a DARPA (iirc) program trying to solve this problem in the context of surveillance in a loud crowded room. Their conclusion was that there needed to be n+1 microphones in the room to be able to cleanly differentiate all of the noise, where n is the number of noise sources, which in their case was number of conversations going on in the room (assuming no other loud sources of noise like music).

I think it's totally doable but you'd need many more microphones in order to deal with real world noise. As MEMS microphone quality improves, this should eventually be possible with a combination of smartphone/headphone/some other device like something around your neck.

diob
8 replies
21h24m

The problem is you need a full sentence, plus surrounding sentences to properly translate a lot of things (aka context matters).

So no matter what, conversations in your native speech would have to be delayed before translation.

sexy_seedbox
2 replies
18h16m

So then we need something like neuralink to get the whole thought from one's brain first, then the sentences are processed properly for the context, then translated before the speech is delivered.

kabouseng
1 replies
13h0m

Most thoughts are in a language. There is no one underlying universal machine language for the brain.

plastic3169
0 replies
10h37m

Are most thoughts in language? This doesn’t reflect my experience. Language floats on top, but there is layers under there. You can also feel it when you end up thinking in another language. It does not go through the first one but is a thing of it’s own.

Pretty sure there is nothing universal there though as you say.

ItsMattyG
1 replies
19h53m

My understanding is that they trained a separate model to specifically estimate when they have enough context to begin translating, as a skilled translator would.

scherlock
0 replies
17h19m

My mom used to do English/French translation. Her favorite example was the word "file". That word has multiple translation in French depending on the context, and that context may simply be implied by who is speaking. You may not be able to figure it based on the conversation alone.

DigiDigiorno
1 replies
19h3m

Even the native original version needs the proper context. Sometimes you need the entire sentence to figure out what the sentence was really about.

I'm reminded of Mark Twain complaining about verbs arriving at the very end of sentencess in German (among a myriad of other complaints)

"The Awful German Language* -Mark Twain https://faculty.georgetown.edu/jod/texts/twain.german.html

scotty79
0 replies
18h14m

Sometimes you even need a second sentence of even a few to understand what the first sentence was about.

ChuckMcM
0 replies
21h6m

I think I could adapt to that. But it would be an interesting experiment.

dimitrios1
4 replies
21h52m

Another lesson we can learn from Sci-Fi is very often different species on a planet would have their tribal / local languages and dialects but all spoke a common tongue. I think this is the more humanizing approach, rather than delegate even more of our fleshly processing power to machines.

somewhereoutth
2 replies
21h32m

This seems to be what is happening in Europe (and perhaps more generally across the globe), with English being the common tongue.

Question is, what will happen to the tribal / local languages? Will they survive?

nemomarx
0 replies
17h49m

Historically, we've seen the larger languages build themselves up by intentionally stamping out the teaching / use of smaller local languages. France banned some regional languages from appearing on broadcast television for years, and etc.

This might be required to get full buy in for a unified language, which is a bit sad but makes some sense - if you ensure it's taking up more and more of media and culture more people know it from immersion, and other languages are reduced to being spoken at home / with friends and that's going to cut into how many people really are fluent in them.

Cthulhu_
0 replies
19h43m

It varies. A lot of local languages have gone extinct already. There's linguists hard at work to try and document / record dying languages, but it won't be the same as living the language from childhood.

micromacrofoot
0 replies
20h41m

then of course, there's always Darmok and Jalad at Tanagra

baby
2 replies
21h39m

I’m wearing the Rayban Meta right now and they are already mind blowing, I can already talk to that Meta AI assistant seamlessly. I bet one of the future iteration will have exactly this.

figers
1 replies
20h24m

Curious, what do you ask it besides take a picture / video or what's the weather?

I have a pair and have only asked it that so far...

baby
0 replies
16h51m

Whenever I have a question and used to pull up bard/chatGPT, and if I’m wearing my glasses.

Kind of like having an expert next to you all the time.

pokstad
0 replies
15h39m

I look forward to the day when that problem is solved by a company that doesn’t mine my data to sell ads.

TheHumanist
0 replies
22h5m

Babel Fish

999900000999
52 replies
1d

Can't wait for someone to roll a language tutor out with this tech.

Everyone gets a personal tutor for hours a day.

I would absolutely love a VR game where I just need to work in China or Mexico all day and pick up the language that way.

jahewson
11 replies
1d

Isn’t having the AI do it for you better than having the AI teach humans to do it?

dylan604
5 replies
1d

Sure, if you're not into personal growth. Not everyone wants to become the useless bit of lard sitting in a chair while a computer does everything for them. Yet. Some of us still like to do the actual things, but just need some assistance along the way. We still have a bit of time before we're all the humanoids from Wall-E

ericmcer
2 replies
22h37m

Yeah thats why I mill my own grain and am getting into textiles.

djvdq
1 replies
21h39m

I love when people use this pathetic extreme examples, when they don't have any meaningul arguments.

ericmcer
0 replies
19h5m

That isn't an extreme example at all, people used to mill grain and make clothing by hand, now we don't. We somehow are not sitting around getting fat even though technology takes care of those tasks.

The parents suggestion is that if we don't have to learn languages that will lead to us all laying down drinking big gulps while robot slaves take care of us. Their take is the extreme example. People have literally made this same suggestion about every technological advance and it never comes true.

TeMPOraL
1 replies
19h43m

We still have a bit of time before we're all the humanoids from Wall-E

Obligatory reminder that the movie itself explains that people are what they are not because of their lifestyle, but because of the time spent in low-gravity environment.

dylan604
0 replies
19h15m

not sure that really matters to the point

whoisburbansky
0 replies
1d

It depends on what your goal is; for some tasks it's possible that getting the AI to do it is best, but, e.g. the existence of auto-pilot doesn't mean that hobbyist pilots wouldn't benefit from/enjoy exercising the same skills manually.

swatcoder
0 replies
1d

Maybe prior to fluency, for something like an odd business or tourist trip.

But there's a point in language learning where you can come to express yourself directly in a new language without intermediary "thinking" in your first tongue. The communicative and expressive potential of that mode is much higher than trying to squeeze one's intent through any kind of translation, machine or internal.

Plus, you know, it's fun.

modeless
0 replies
1d

Even a perfect human translator following you around wouldn't be anywhere near as good as knowing the language yourself.

j33zusjuice
0 replies
1d

Not necessarily. It depends on the use case. For taking a vacation, having an AI that can instantly translate to your native language would be amazing. That’d solve a lot of real world problems, no doubt.

However, translation has a great deal of subjectivity embedded in it, particularly when there aren’t 1:1 translations. Case-in-point: there are many English translations of the Christian bible, all similar enough, but there are enormous variations in some cases. And there are at least as many branches of Christianity as there are English translations of the Bible. Some of them strictly recommend the same translation, and they still disagree on the meaning of various passages.

Besides the problems inherent to translation, learning another language gives you another paradigm of thinking. The words we use, the way we construct sentences, etc., all impact our view of the world. Here’s a paper that discusses the impact of the over-reliance on English in cognitive sciences, and how this has downstream effects: https://www.sciencedirect.com/science/article/pii/S136466132...

Learning languages as an adult also has protective benefits. It reduces the probability of Alzheimer’s (maybe dementia, overall?).

coldtea
0 replies
1d

In the way that watching porn is better than having sex.

modeless
9 replies
1d

This is what I'd like to build (the tutor part at least, not the VR game part yet). I'm planning to extend my current English only rough prototype[1] to support Mandarin. (I happen to be learning Mandarin myself at the moment, and there are a bunch of open source bilingual Mandarin LLMs and speech synthesizers from China to choose from.)

I think a lot of people are working on similar things right now. I know of one called http://yourteacher.ai

[1] https://apps.microsoft.com/detail/9NC624PBFGB7

siraben
8 replies
1d

Is there a high quality speech synthesizer (ideally local) for Mandarin you have found? There are some subtleties with tone sandhi rules and how they interact with prosody that I feel are lacking with current TTS voices I’ve tried.

gattr
3 replies
23h59m

I love the idea of LLMs being super-efficient language tutors. And you have a good point; coming soon: "We've been getting a lot of these tourists here lately, they're eerily fluent, but all seem to have the same minor speech impediment" (read: messed-up weights in a commonly used speech model).

bityard
1 replies
23h5m

all seem to have the same minor speech impediment

Ah, that is called an accent.

dontupvoteme
0 replies
21h49m

Kind of, Accents are typically derived from the intersection of natural languages, specifically which ones you learned the phonetics of first. (With the exception of the Mid-Atlantic accent...)

This would be something quite novel as the speech irregularities would not have their origin in people

I don't know what you would call it but it needs at least some adjective before accent to differentiate it IMO

siraben
0 replies
23h6m

I've been using ChatGPT 4 to translate and explain various texts in Mandarin and it's been very on point (checking with native speakers from time to time, or internet searches). As expected, it has trouble with slang and cross-language loanwords from time to time. However for languages with much lower information online, it hallucinates like crazy.

coming soon: "We've been getting a lot of these tourists here lately, they're eerily fluent, but all seem to have the same minor speech impediment"

Haha, if that were to pass, that would still be a far better outcome than our current situation of completely blind machine translation (this is especially for various Asian languages that are very sensitive to phrasing) and mispronunciation by non-native speakers.

modeless
2 replies
1d

The first one I plan to try is https://github.com/netease-youdao/EmotiVoice

I don't have the expertise to judge the quality of Mandarin pronunciation myself, being a beginner. But it sounds OK in English and it's made by native Mandarin speakers in China so I expect that it sounds better in Mandarin than English.

siraben
1 replies
18h30m

Sounds pretty good, although still lacking in natural-sounding tone sandhi (e.g. try 一下, it should be yi2xia4 instead of yi1xia4).

999900000999
0 replies
13h40m

Do you have a favorite Chinese learning app ?

rnjesus
0 replies
20h9m

the azure neural tts voices in chinese are the best i’ve heard, specifically the “xiaochen” voice. i use it in anki daily to generate sentences for my mandarin decks with an api key/plugin. it’s not something you run locally of course, but they have a decent enough free tier.

i’m hoping a voice as realistic as this becomes a local app soon, but i’ve not found anything that’s nearly as natural sounding yet. (also, honorable mention to chatgpt’s “sky.” she pronounces mandarin with a funnily american accent, but it sounds natural and not as robotic as the open-source alternatives i’ve tried)

meowtimemania
7 replies
1d

There’s already a few of them. Checkout https://hallo.ai

999900000999
6 replies
1d

I wouldn't feel good about anything that's not focused on a single language.

You end up with the Duolingo problem where you know to say the names of 20 different fruits but not how to introduce yourself.

coldtea
1 replies
1d

Never seen that in Duolingo. It starts with the basics and phrases, not random useless vocabulary.

cptskippy
0 replies
23h16m

I was going to Italy and started using Duolingo to try and help. I learned such useful phrases as "the children have bread".

numpad0
0 replies
23h10m

(Duolingo problem(, AIUI): Duolingo is designed around such premise that, by exposing your subconsciousness to such small set of words and phrases in target languages, your brain should be able to trivially construct output shims from Universal Grammar, which must exist, to desired languages; but that doesn't work in practice and you end up with small set of words and phrases your subconsciousness had recorded)

massimokris
0 replies
21h10m

the Duolingo's problem it is not because they have a bunch of languages, it is because achieving fluency in a target language it is about been able to produce/generate phrases, and they just move you to consume and sort words and phrases. in the case of any AI Language tutor, the student must produce phrases in order to practice, and that makes them advance in the path to achieving fluency

gs17
0 replies
23h20m

Duo has a different problem for me. The lack of focus means some languages don't get features. Chinese still doesn't have Stories (there's an unofficial version of it, but we've been waiting years).

apwell23
0 replies
1d

You end up with the Duolingo problem where you know to say the names of 20 different fruits but not how to introduce yourself.

Not sure if this is a duolingo problem. There are of modules in duolingo specifically for saying your name. I think its the travel module.

spaceywilly
4 replies
1d

To me the key functionally for any language learning app is giving you feedback on your pronunciation and general understanding. I’ve been using Duolingo to learn Mandarin and when I try to speak to anyone it’s difficult for them to understand me, because my pronunciation is all wrong. The app is just feeding info to me one way, and I can try my best to recreate what I’m hearing, but there’s no way to know if I’m messing it up. They do have a speaking feature but it doesn’t work very well, certainly not to the same level as speaking with a real person who is fluent in the language and having them correct you.

throwaway4aday
0 replies
23h8m

As a quick solution, you should try recording yourself speaking and then listen to it to check your pronunciation against some reference. So for example, find a YouTube video in the language you're learning that also has good subtitles (use https://filmot.com/ ) and listen to how they say the phrase and then record yourself saying the same phrase and play it back and compare.

kccqzy
0 replies
10h57m

It's the same struggle language learners have faced for a long time regardless of app or not. I did careful studies of the French grammar, read French books, listened to audio tapes in French, and I still got prononciation wrong often.

I resisted using Duolingo because I knew their speaking feature sucks. But the only reason I need an app rather than books or audio tapes is that I need something to correct my prononciation.

dog321
0 replies
19h40m

I practiced for a long time using the below pronunciation trainer and I get a ton of compliments from native speakers on how accurate my pronunciation is.

https://fluent-forever.com/product/fluent-forever-pronunciat...

addandsubtract
0 replies
7h51m

There are other language learning apps, such as Busuu, which make you record and peer-review other people's pronunciations.

bilsbie
3 replies
1d

I think it would be so ironic if advanced AI ended up simply teaching us new languages quickly instead of translating for us.

toomuchtodo
1 replies
23h50m

Might be able to generate a better language than what we have.

bilsbie
0 replies
22h23m

Good point. Maybe they invent a better language and easily teach it to everyone.

dontupvoteme
0 replies
21h49m

Finally Esperanto has a use case!

advaith08
2 replies
1d

seen a lot of these, but none for Indian languages. Would love to try an Indian language one!

999900000999
1 replies
23h16m

Are Indian languages hard for English speakers?

thinkingtoilet
0 replies
20h28m

I'm learning Hindi and there are somethings that are easy (phonetic alphabet, nothing like 7 different sounds for 'ough') but the sentence structure is very different and can be hard to get right. Pronunciation isn't too bad for the most part but there a few tricky things, for example four different 't' sounds and four different 'd' sounds. The hardest part is that there really aren't that many resources. Even though Hindi is the third most spoken language in the world, you will find far more resources for many of the less spoken European languages.

flanbiscuit
1 replies
21h19m

I would love a game that helped you learn a language (not necessarily VR though as I don't have that equipment). The game drops you into a world (a country of the language the game is meant to teach you) where no one speaks your language and you have to figure out what people are saying in order to fulfill quests. You get some hints, like maybe you have a simple translation guide in your inventory or sometimes you meet people who can speak a few words of your language. That would motivate me to learn faster than self-taught tutorials.

I'd love to learn French and the game would take place in locations all around modern France.

It would have to a good story. Maybe something in the style of Professor Layton series could be interesting, or something more open world.

resonious
0 replies
2h41m

If Professor Layton itself has a French translation then you're more than half of the way there! Existing games are already quite good for language learning. But indeed they're missing the "realistic" element that you're after.

zbyforgotp
0 replies
19h0m

But will people use them?

tmountain
0 replies
1d

Started a project to do this a while back. It's pretty fleshed out:

https://www.parcero.ai/

I could integrate this instead of Polly pretty easily.

massimokris
0 replies
21h19m

I built one for people in Latam to practice languages in a conversational way through a WhatsApp chat https://wa.me/+5491162951713?text=hola%20Speakeasy

jbird11
0 replies
22h45m

Absolutely, what I've noticed is that the current apps are great for beginners but after a certain point the only way to improve your ability to speak a new language is to well... speak it. I built Proseable to help people move beyond the generic how to order a coffee or ask to go to the bathroom, and have more meaningful conversations in the real world. Check it out!

https://www.proseable.com/

inbread
0 replies
22h56m

I built just this a month ago with the Azure AI speech API, which is already pretty good at multilingual speech.

https://github.com/adrianmfi/gpt-tutor

I look forward to testing if switching to Seamless can improve it further, Seamless supporting nearly 100 languages is a nice improvement.

dwighttk
0 replies
20h28m

and the language tutor company could have you pilot around a menial labor droid while you are learning...

dontupvoteme
0 replies
22h8m

For Language Acquisition, Input Is All You Need. (Mostly)

What would be really cool is something that can autodub videos or audio into your target language. The hardest problem learning languages that aren't English is often finding content to consume in them.

Disclaimer : I am Krashenist so this take is biased

Jeff_Brown
0 replies
22h10m

game

Yes! Better yet, you're a spy, or a hostage negotiator, or the leader of any kind of enterprise (army, business, aid organization) ...

Programming games like that will resemble directing improv theater. You can't program every response; you'll have to instead fit each character with beliefs and motivations.

I can hardly wait.

navbaker
34 replies
1d

Seamless Streaming looks really promising! We just had a new employee start a few months back with profound hearing loss and our company had no idea what to do with him from an accessibility standpoint. They threw out solutions like Dragon, not realizing those solutions are not real-time.

He ended up rolling his own solution by standing up Whisper in one of our clusters and writing a basic front end and API to take his laptop’s mic input and chunk it every few seconds to send to the model and get back text in pseudo-realtime. We got him a pretty beefy Alienware so he wouldn’t be tied to the cluster GPUs. I can’t wait to see what he does with these new models!

cgb223
10 replies
23h39m

Just wanted to say you’re a great employer to be so incredibly accommodating to the point you get them an Alienware and let them roll an accessibility solution

We need more support for employees like this!

cced
5 replies
23h37m

Second this!

Also, what about Apple’s latest M3 series chips? Are this in the same realm as Alienware in terms of AI compute?

jackson1442
2 replies
23h25m

I think generally the consensus of Apple Silicon is that they're great _for a laptop_, but still aren't going to beat a dedicated graphics card + high-end CPU like i9/Ryzen 9. Biggest thing going for apple is the performance/watt though which is critical for a laptop.

cjbprime
1 replies
22h38m

I think this is missing the main reason to use Apple Silicon, which is that your dedicated graphics card probably has 24GB or less of RAM, whereas e.g. an M2 Ultra Mac Studio can have 192GB of RAM with a far superior memory bandwidth to anything on x86. This is important because even a "small" LLM like Llama2 13B would require quantization to fit in the 24GB RAM that the dedicated graphics card will give you, whereas the Mac could run Llama2 70B without quantization (at FP16).

aftbit
0 replies
22h2m

Whisper doesn't need that much RAM though.

willy_k
0 replies
23h17m

They definitely are in terms of energy efficiency

nodja
0 replies
22h55m

They're better than most consumers x86 CPUs but worse than using a GPU. Where they shine is when the ML model can't fit the GPU's VRAM since you have better options for ram size with macs.

romwell
3 replies
22h26m

Just wanted to say you’re a great employer to be so incredibly accommodating to the point you get them an Alienware

So gracious, to give a software developer some hardware to run the software they need to work, that costs a whopping nothing more than what other people in the industry get on the average.

and let them roll an accessibility solution

"You're such a good employer! You let your employee build their own accessibility ramp to the back entrance in their own time, and even got them a mortar spatula to do so!" We need more support for employees like this!

We need more support for employees like this!

And less support for employers like this.

Solvency
2 replies
21h3m

Not sure why you're being downvoted. Literally the equivalent of building your own ramp.

freedomben
1 replies
19h23m

I didn't downvote, but I considered doing so because nowhere that I saw in GP does it say in his own time, and that's a critical piece of the equation. Hallucinating that datum means they got the argument wrong, and worse they were harshly critical of the company based on that wrongly assumed information.

It reminds me of the Homer Simpson quote, "I don’t mind being called a liar when I’m lying, or about to lie, or just finished lying, but NOT WHEN I’M TELLING THE TRUTH!" I would be equally critical if it was warranted, but when it isn't it's deeply unfair to the accused.

If the person wanted to build their own ramp, and the employer let them do it on the clock, that's a completely different scenario than the employee having to come in during their off-hours to build the ramp just so they can go to work.

navbaker
0 replies
14h33m

Yeah, it wasn’t on his own time. He had a full budget and this was right in line with stuff he had already done research in anyway, so he just went for it.

qkeast
9 replies
23h31m

Awesome! I love hearing about places making the effort to be inclusive.

As someone who’s profoundly deaf myself, another less technical approach is to install Rogue Amoeba’s Loopback, and use it to pipe audio from a given app into a tool like Google Meet or Otter.ai using the Loopback device as the audio source. This effectively provides real time captions for anything running on your existing machine.

romwell
3 replies
22h23m

Awesome! I love hearing about places making the effort to be inclusive.

The extent of the effort being getting their employee a slightly-more-expensive-than-average tool that would enable them to do their job better regardless of the disability?

Such inclusive, much pat-yourself-on-the-back, wow.

"We gave our woodworking shop employee a quality saw so that they'd make their own accessibility ramps!"

qkeast
0 replies
21h44m

I have literally been told in job interviews that the company would not be “allowed” to hire me because I’m hearing impaired, so yes, making an effort to support an employee’s disability and their needs is worth recognizing.

callalex
0 replies
21h45m

What would you have them do instead?

RogerL
0 replies
19h41m

So what? Okay, in the case of a ramp, if you need one you probably are going to have difficulty building one. So pay employee Sally to build it instead, absolutely.

But hearing loss does not impair standing up servers and software. They can pay the employee who probably is the expert at this, the guy with the hearing loss, or go task Emil to go do it to ... avoid 'appearances'?

navbaker
1 replies
14h28m

We definitely explored using these tools, but we’re constrained by government sponsor rules regarding data protection in our day to day work. We can use ZoomGov captions, but most of the other tools weren’t approved. It looks like Windows 11 has a real time solution of some kind, but we’re still stuck on 10.

sdrothrock
0 replies
10h26m

I have profound hearing loss and rely on the Windows 11 captions. They are absolutely best in class and head/shoulders above any other automatic captioning I've used. My coworkers have a variety of accents (Hispanic, Eastern European, South African) and it does a great job with all of them.

Additionally, it supports multiple languages (only one at a time sadly), so I also use it for Japanese captions and it's equally great there.

tuukkah
0 replies
22h54m

Clever use of Google Meet as a tool! Also, Google Pixel phones now provide realtime captions to any speech playing on the phone (Accessibility > Live Caption). You can also choose a "preferred language" and the captions will be automatically translated to that language from other languages.

jallmann
0 replies
22h37m

Google Chrome [1] also has captioning built-in [2], so this could also work from a plain page that hooks into the loopback device. Pretty sure it's using the same TTS backend that Google Meet uses.

The nice thing about Chrome feature is you can move the caption box around and keep it in the foreground while doing other things, although styling options seem limited (the text might be a little small for some).

[1] on desktop, not sure about mobile

[2] via chrome://settings/accessibility -> Live Caption

fy20
0 replies
8h10m

Whisper is pretty good for speech to text, and can be done with in a resource constrained environment. I tried a demo running in a browser using WASM on my phone and even the tiny model is not bad.

pawelduda
2 replies
22h50m

That's very nice of you

romwell
1 replies
22h22m

He ended up rolling his own solution

That's very nice of you

...doesn't compute.

What exactly was nice here?

diab0lic
0 replies
18h36m

We got him a pretty beefy Alienware so he wouldn’t be tied to the cluster GPUs.

Probably this.

FloatArtifact
2 replies
21h56m

Problem with whisper is its not really optimized for command recognition versus general dictation.

- Whisper processes 30 second audio chunks. So if you process 5 seconds of audio you have to pad it out with 25 seconds of silence. Hence a loss of efficiency with wasted CPU / GPU cycles on 25 seconds per chunk in the case above.

- Whisper most likely can't handle hundreds of commands much less than a thousand performantly.

- Whisper doesn't handle short commands very well with a degree of accuracy post processing commands from free dictation utterances.

Command dictation should be weighted higher than general dictation when decoding.

I work with a little under 1500 of commands dragon naturally speaking. DNS is hot garbage as a program despite it has the best accuracy to date with the feature of commands and dictation in one utterance. You get to pay $750 for the privilege m

I've yet to see a free and open source speech recognition engine that can handle both dictation and commands with a high degree of accuracy.

Please please let me know if there's alternatives out there. I would definitely pay to support an open source project like this that focuses on command and dictation.

Most solutions out there that are open source nowadays focus so much on iot command recognition with intents. That's not well suited for controlling your computer with grammars containing voice commands.

novok
1 replies
20h48m

Is 30s the input size set by the model, or programs that wrap the model? Is it how it's trained?

bakkoting
0 replies
18h50m

It's a property of the model itself.

Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder.

https://openai.com/research/whisper

lovich
1 replies
22h41m

Y’all should turn that into a product, or at least open source it and get the positive PR + helping others

FloatArtifact
0 replies
21h6m

Y’all should turn that into a product, or at least open source it and get the positive PR + helping others

There you go. https://github.com/dictation-toolbox/dragonfly

kylixz
1 replies
22h22m

I recommend checking out: https://talonvoice.com/

FloatArtifact
0 replies
21h23m

It's not open source nor does the author intend to open the stack.

aftbit
1 replies
22h2m

Check out Willow! It does essentially this, using WebRTC. It doesn't handle the near-real-time response yet, but it does stream the audio to the server and the change would be pretty minor.

FloatArtifact
0 replies
21h11m

Check out Willow! It does essentially this, using WebRTC. It doesn't handle the near-real-time response yet, but it does stream the audio to the server and the change would be pretty minor.

Simply voice to text is not what's needed for dictating commands. Unless I can load commands of on the fly and decode utterances that may be useful.

The client would need to be able to send its commands to the server on the fly.

sagz
0 replies
20h30m

Do they need realtime transcription?

Computer: webcaptioner.com Android: Live Transcribe (g.co/livetranscribe) iOS: Live Caption with the 'mic' icon enabled.

Web conferencing: Meet, Zoom, Teams all support realtime CC, which is pretty good.

londons_explore
28 replies
1d

Does "reduce toxic words" and "promoting safer communication" mean that if you say something wrong about LGBTQIA+ people it will 'correct' what you say?

I'm not sure I want the latest twitter trend to be involved in the design of my translator...

mortimerp9
13 replies
23h49m

Hi, I work on seamless. What this refers to is added toxicity mitigation. We try to detect the level of toxicity in the input and make sure that the output toxicity level is not higher. This protects the model from doing egregious errors in the translation.

There are more details in the paper if you want and the mitigation code is all open source if you want to check what it actually does.

Reubend
4 replies
23h30m

That's an awesome feature. I think one of the worst possible outcomes of machine translation is something that ends up being accidentally offensive, and this is a smart way to mitigate that.

SoftTalker
2 replies
20h46m

Or maybe we'll finally come around to the idea that being offended by words doesn't make a lot of sense.

madeofpalk
0 replies
1h6m

I'm sure you can understand why translating "I love you" to "I love you, bitch" is probably undesierable.

hiatus
0 replies
6h12m

This will happen at the same time we stop being uplifted by words, or moved by them, or brought to tears by them, or fall in love over them.

fl7305
0 replies
22h23m

one of the worst possible outcomes of machine translation is something that ends up being accidentally offensive

The Hitchhiker's Guide To The Galaxy claims the opposite:

"Meanwhile, the poor Babel fish, by effectively removing all barriers to communication between different races and cultures, has caused more and bloodier wars than anything else in the history of creation."

dontupvoteme
2 replies
21h19m

How do you account for colloquial (non-English) language which could be naively misconstrued as toxic?

e.g. "geil" (either cool or horny depending on usage) in German

It's not fundamentally different than e.g. "wicked" in English, but the biggest bias that potentially all these ML models exhibit is predisposition towards Anglophoneism

mortimerp9
1 replies
20h46m

Our goal is to have a good recall, sometimes to the detriment of precision, so for words with multiple meanings, it might consider them toxic when in the actual context they are used in, they are not. The toxicity mitigation algorithm will search for alternative translations that have the correct meaning but not the potentially toxic word so that there is no added toxicity in the output. This means that sometimes the model might prefer a less coloquial phrasing than what a human would.

You can find details on how the multi-language creation of the toxicity lists was done in section 7.3 of the NLLB paper: https://arxiv.org/pdf/2207.04672.pdf. TLDR: it's not just a translation of a base English list, even if we started from that, each language has a curated list that was built by professional translators.

dontupvoteme
0 replies
20h8m

That's significantly less myopic than I pessimistically assumed. Thanks!

novok
1 replies
20h45m

Is there an ability to turn it off? If you're translating an R rated movie with criminals who swear a lot, is it possible to get non-toxic filtered output to make sure it's being translated properly?

mortimerp9
0 replies
20h39m

it only kicks-in if the output is more "toxic" than the input. If the input has a lot of swear words and the output has the same amount, then it will be left alone.

Domenic_S
1 replies
23h41m

What this refers to is added toxicity mitigation.

Oh, well that clears it up! </snark>

I don't see any definition of 'toxicity' on the landing page - it seems to be one of those 'I know it when I (hear) it' kind of words... unless there's some widely-accepted definition in this area of study?

mortimerp9
0 replies
20h41m

Sorry if I wasn't clear, internally we've been talking about it a lot, but I forgot that it doesn't have such a solid definition outside of our work. Thankfully, we try to define it in section 7.3 of the NLLB paper: https://arxiv.org/pdf/2207.04672.pdf

The tldr is that if you say: "Thank you for this job offer." you wouldn't want it to be (mis)translated as "Go F*k yourself.". But if you do say "Go F yourself", you still want it to be translated as that.

thomastjeffery
0 replies
12h10m

What about the inverse?

Can it make sure that the output toxicity level is not lower than the input?

If not (which I strongly suspect is the case), then that is unacceptable. We cannot fight toxic narratives with ignorance.

jadbox
4 replies
23h28m

Your comment seems to imply LGBTQIA+ is just a Twitter trend, versus people's lived experience and lifelong identity. This is as unnecessarily judgment as small identities claiming that straight people must self-identify cis.

There is no moral superiority to deny or force label other people's identities. You're an attack helicopter? Great, roger dodger, let's go get coffee Seahawk.

No one is seriously asking for litter boxes in school bathrooms or helicopter refueling stations.

mpalmer
2 replies
23h11m

No one is seriously asking for litter boxes in school bathrooms or helicopter refueling stations.

This feels a bit out-of-nowhere.

My read on parent comment was that "Twitter trends" are fast-changing norms about what language is (un)acceptable. They were not saying that LGBTQIA+ identity itself is a trend.

jadbox
1 replies
23h2m

Perhaps so. In light of yesterday's Russia announcement for labeling the "international LGBT public movement" as terror extremists, I think we should be careful what we label as fads or (worse) insidious activity. Source: https://www.themoscowtimes.com/2023/11/30/russia-bans-intern...

mpalmer
0 replies
22h35m

You seem to me to be arguing against points no one is making. You're taking the word "trend" and extrapolating it to "fad" and "insidious activity" - both of which have very different meanings and connotations to the phrase "Twitter trend".

The original comment you replied to made the point that they don't want their own personal expression curtailed or modified according to someone else's opinion of acceptable speech.

As someone who repudiates Russia's policies, I support and agree with their point.

thomastjeffery
0 replies
12h15m

Thinking more directly about the subject in hand, what if we took their comment text as an example, and input it into a "responsible translation model"?

Taking what they wrote as harshly as possible, a translation model's output might include narrative elements from the transphobic judgement you are concerned about. That would be a problem, because it would amplify transphobic narratives.

Taking what they wrote as favorably as possible, a translation model's output might rephrase what was written, such that a pro-LGBTQIA+ inclusion narrative is more eloquently expressed than the author actually intended. That would be a problem, because hiding the reality of transphobic narratives would remove our ability to recognize and talk about them.

To make this even more complicated, what if we are using this model for real-time dialogue? What happens when someone says something vaguely transphobic, their words get translated to an inclusive narrative, and you continue that inclusive narrative in your reply? Should the translator alter your words to be transphobic? If it doesn't, then will the entire conversation go off the rails, or will both parties continue, oblivious of each others' ideological subtleties?

---

I don't believe for a second that a model could be trained to avoid toxic narrative and translate accurately.

Hallucination is a feature, not a limitation. The sooner "AI" narratives can accept this reality, the better.

beardicus
3 replies
23h46m

the site makes it pretty clear in multiple places that they're talking about "added" or "hallucinated" toxicity. maybe your culture war outrage is misplaced?

Domenic_S
2 replies
23h37m

Ok so I know nothing about how this works. It seems like if the model was able to properly detect words in the first place, it would never hallucinate 'toxicity'; if it can't recognize the word with high probability, how will it know whether the speaker actually said $toxicWord or whether it should print something else?

Perhaps it's taking a Big List of Naughty Words and weighting them so that the system must be "extra sure" that's what the speaker said, or else fall back to a G-rated word?

numpad0
0 replies
22h42m

Maybe it's for preventing unwarranted fucks[1]? Translation is more than just concatenating dictionary definitions, and machine translations routinely make this kind of out-of-place and technically correct lookups.

1: https://www.google.com/search?q=engrish+fucking+sign&tbm=isc...

mortimerp9
0 replies
20h36m

Meta employee here. The system is not perfect, or it would not "hallucinate", while it's pretty good, it does sometime make errors (not just hallucination, maybe just some mistranslation due to noise in the training data). What we want is to avoid these errors to introduce toxicity (think swear words) that weren't in the input as this could be very bad for the user. There is a separate system that double checks the output (compared to the input) and tells the translation model to try again if it's too bad.

jwineinger
2 replies
23h55m

Their video said it was to reduce toxic word hallucinations, which does seem admirable/useful. I'm testing real-time translation in a church setting, and I've witnessed whisper hallucinating profanity, which is quite undesirable.

kelseyfrog
0 replies
23h32m

It also happens to be quite hilarious.

cgb223
0 replies
23h38m

“Toxic word hallucination” would be a great punk rock band name

sjbase
0 replies
22h54m

Please don't use Hacker News for political or ideological battle. That tramples curiosity.

From the hackernews guidelines

madeofpalk
0 replies
23h37m

Your framing of basic respect as being a "twitter trend" is... bizzare.

nickreese
19 replies
1d1h

My wife was training to be a professional voice actor to do dubbing in several languages when we met.

I told her then that the industry would be disrupted by AI before she retired.

Glad she pivoted. Really impressive results.

0_____0
12 replies
1d

It won't replace high-end talent, I don't think models can replicate the nuance for a long time, however the entire low-to-mid end of the market is going to get nuked from low earth orbit

crakenzak
5 replies
23h48m

It will absolutely replace high-end talent. Anything that a human can do will be able to be done 10x better by a model -- especially in such a narrow and well defined domain.

sushisource
4 replies
23h38m

Did you hear the output examples? Yeah, I think not. I mean, definitely on the way, but there's no way if you need quality acting in your dub that you're going with this.

ygjb
0 replies
21h57m

These are models specially tuned and sized for near real-time, instant translation. It would be naive to think that there aren't technical creatives building and training models tuned for expressiveness and nuance in a more controlled environment.

dvngnt_
0 replies
21h42m

i think the key word is will.

a few more years of improvements if they happen could be disruptive

dontupvoteme
0 replies
21h35m

That's what they gave us plebs. To think they don't have a superior one they can sell...

crakenzak
0 replies
21h56m

Maybe not in the current state of the model, but judging by the rate of improvement we’re all seeing it’s just a matter of time (and data+compute+research obv).

chrismorgan
2 replies
21h32m

It won’t replace it, but it’s very likely to supplant it, just about destroying the segment by reducing demand by being good enough and so much cheaper, especially as people get more used to it.

Typesetting. Music engraving. Bookbinding. The quality of all these fields have been materially harmed by advancements.

Computer typesetting has, by and large, been a significant regression, though the gap has largely been made up now if you make the right choices.

Published music scores used to be set by experts. Now they’re set by novices using software that is mechanical in method and generally quite insipid. Most are atrocious compared to the old masters, and mediocre at best compared to the typical published scores from a hundred years ago; and very few popular scores are really good (… and if they are, there’s a reasonably high chance they’ve used GNU LilyPond, which has focused on this problem). But the barrier for entry is so much lower, and people have got used to the inferior results, so I don’t know if anyone engraves music the old way, and even people that know better largely just shrug and make do with the new. Like with computer typesetting, there is hope because things have slowly improved. But most will continue to be mediocre.

Books used to be bound with cold glue. It takes time to set, but the results are very good, supple and long-lasting. Then along came hot-melt glue, and it’s just so much friendlier for cheap manufacturing because books are finished within a few minutes instead of a day or two, that I don’t think anyone produces books the old way any more, even though the results are abysmal in comparison (compare the binding and reading experience of a paperback from the ’40s or ’50s with one from the turn of the century; no one after tasting the old will desire the new; for he says, the old is good). But they’re just (barely) good enough. Unlike the other two, I don’t think there’s any hope here—the regressive advancement crowded out the superior but dearer option so that no place was found for it.

pclmulqdq
1 replies
19h56m

You can still get relatively good published music scores from a few of the old German shops (Schirmer, Henle, etc.), but they are very expensive. They are a joy to use when playing, though, since the music is very clearly laid out and page turns are in the perfect place, etc. Finale and Sibelius are controllable enough that you can use them to do fantastic layout, but many people either do not understand how to make a score readable or don't care enough.

TeMPOraL
0 replies
19h33m

That, and what GP describes, is what I see as the overall trend of the market to hollow out the middle. It's not just about technology (though it plays a big role), as all optimization coming from competitive pressure - materials, processes, business models, marketing.

What seems to universally happen is, the market bifurcates - one part is in a race to the bottom, the other (much smaller) aims for super premium tier (overpriced quality), because only those two positions are sustainable, once the race-to-the-bottom side drags all the economies of scale with it. So as a consumer, you get to chose between cheap low-quality garbage that's barely fit for purpose, and rare, super-expensive, professional/elite high-end products. There is no option for "good value for reasonable price".

This has been happening to everything - software, furniture, construction, electronics, vehicles, food, you name it.

Shish2k
1 replies
1d

I wonder which will happen first - AI evolves to work well at the high-end, or high-end humans retire and there’s nobody left in the low-to-mid end to fill their shoes…

callalex
0 replies
21h41m

Given the modern trend of on-screen actors doing voice work, I think there will be a supply of talent for at least a few more generations.

RowanH
0 replies
21h31m

I'm using AI for training videos for my startup. Never going back to voice actors outside of primary marketing videos. tThe sheer convenience of write/listen/tweak cycle on scripts is insane. In minutes you can do a voiceover which would have have taken hours + days delay prior.

Sure the final result sounds slighty robotic. 99% of people wouldn't care, and you can get more training videos done, faster for a fraction of the cost.

[Edit] And I'll add the difference from 6 months ago is noticeable to today. I imagine every 6 months we can just re-download updated voiceovers and every 6 months will sound just slightly more polished..

ggregoire
3 replies
23h16m

I told her then that the industry would be disrupted by AI before she retired.

Yes. I just discovered there is a text-to-speech addon [1] (now a few months old) for World of Warcraft that adds voices for every NPC in the game... It is so impressive and game changer (pun intended) that I naively asked in the chat of the Twitch stream I was watching "when did Blizzard add voices to the NPCs??". For an instant I really thought Blizzard contracted actors, but no, someone like you and me just used AI to generate realistic voices for every character in the game. I don't think it's ready yet to completely replace actors in video games (surely it will in the near future tho) but voice acting is something so expensive to do that I can see studios and developers in 2024 already use this tech for all the optional dialogues and secondary characters' voices.

[1] https://www.curseforge.com/wow/addons/voiceover

freedomben
1 replies
19h17m

I've wondered at what point this would happen. I think it could now, but from what I've read the voice actor unions are able to prevent it currently (at least for AAA games or non-indie devs). Many of them have agreements/contracts in place for the foreseeable future, and being the first big company to replace them is a heap of terrible press that nobody is going to want to touch. I think it's the same reason Hollywood reached the AI agreement recently too.

GaggiX
0 replies
10h30m

I think that there are a lot of voice actors that are not unionized tho. And games like The Finals already use AI for many voices.

lyu07282
0 replies
20h37m

Another recent example, the finals uses AI voice generation for realtime game announcements

https://youtu.be/kZ87wiHps9s

ilaksh
0 replies
20h6m

What did she pivot to? I don't think any currently existing job is really safe in the medium-to-long term.

Halong
0 replies
22h58m

My wife is paying our mortgage teaching English on Preply. I'm extrememely worried about where we'll be in 10 years.

coffeebeqn
14 replies
1d1h

We can’t be that far off from almost perfect real-time translation. There is some latency of course to hear and process

mrob
11 replies
1d1h

Differences in verb-subject-object word order will always add latency. If you want to translate from German, with the verb at the end, to Welsh, where the verb goes at the start, you'll have to wait for the complete sentence before you can begin.

tralarpa
5 replies
1d

It's very impressive what simultaneous interpreters can do. They don't wait for the end of the sentence.

numpad0
3 replies
1d

Yeah they backtrack on branch prediction failures.

dylan604
2 replies
1d

What kind of heartbleed that must introduce.

Vecr
1 replies
23h34m

You mean meltdown/Spectre?

dylan604
0 replies
23h15m

probably, but you got the gist anyways

MrsPeaches
0 replies
1d

Even they struggle with jokes though.

This may be apocryphal but I’ve heard that in formal settings (e.g. UN) they won’t translate it and will instead give instruction on when to laugh.

d3m0t3p
4 replies
1d

Not necessarily true, for the first few sentences you won’t be able to do it. But afterwards, once the context is established you don’t really need to wait for the verb, you can predict it. For example if you are speaking about cleaning the house and you detail that you have cleaned the kitchen the stove and so on, you can predict the verb with only the start of the sentence. I don’t have any source to back this up, but it sounds plausible

gberger
1 replies
1d

What if the predicted verb was incorrect, but the model has already translated the incorrect prediction? How does it tell you about a mistake?

mrandish
0 replies
23h27m

A good approach might be to start with how top notch, ultra-experienced human translators handle corrections for real-time scenarios, for example, the expert translators that do the ear monitors at the United Nations. I've worked with a few such real-time translators when preparing keynote speeches and they seem to have rigorous processes that appeared quite deep. Probably a ton of domain expertise to be captured there.

That said, I suspect that real-time language translation is always going to be somewhat imperfect due to its nature. Non-real-time translation of literature is still a subjective art form even at the very high-end of human expertise.

shkkmo
0 replies
23h36m

Once you start predicting what someone is going to say you are no longer translating their speech

Teever
0 replies
19h58m

Yeah but then you're just introducing branch mispredictions which will cause latency and potential confusion down the line.

It's all a trade off.

Either way it's extremely exciting that we get to even discuss this stuff as real possiblities.

Innervisio
1 replies
1d

Although true and considering what “mrob” had also replied, this will never mean full translation every time, all the time. This will work with specific environments and linguistic expectations.

I’ve been learning german since 8 years, and the amount of expressions and different ways to say things around the country is impressive. There’ll be a “interpretative” real-time translation, but it won’t guarantee fully understanding in so many cases, maybe ever.

Other thing, and we have this in common with all languages, is the context and this is difficult to address i believe.

Nevertheless, it’s impressive how far we’ve reached and i acknowledge the usability of these tools. However, human knowledge will be always crucial and primordial if we want to guarantee full understanding.

InCityDreams
0 replies
1d

I’ve been learning german since 8 years,

"Since", as used here, would lead me to guess you are not a native English speaker?

fassssst
10 replies
1d

Try the demo here, you record a video of yourself and it does voice cloning and a comparison:

https://seamless.metademolab.com/expressive/?utm_source=meta...

ceejayoz
6 replies
1d

This research demo is not open to residents of, or those accessing the demo from, the States of Illinois or Texas.

Interesting mix.

aschla
2 replies
1d

Likely related to biometrics laws. I know Illinois has restrictions on the collection of biometrics, not sure about Texas. Facebook in particular paid out a significant amount of money in a class action in Illinois, I know because I got a chunk of change from it.

dylan604
1 replies
1d

which you mean someone took a dime and carved off a piece of it, and then sent you a piece of paper with postage that cost more than the value of that chunk? yeah, we all got hosed by that one too i'd imagine

ceejayoz
0 replies
1d

https://www.nbcchicago.com/news/local/illinois-facebook-user...

According to the Settlement Administrator, payments to class members between $200 to $400 started going in the mail May 9.

I got a $0.19 check from an iTunes settlement once, but this wasn't one of those cases.

solardev
1 replies
1d

Illinois has a facial recognition / cloud biometrics ban. Familiar face detection for doorbells etc. isn't allowed there. Wonder if Texas has something similar?

ceejayoz
0 replies
1d

Ah, that makes sense.

In Texas it seems to be part of AG Paxton's culture war stuff. https://www.texastribune.org/2022/05/12/texas-face-filters-i...

jlund-molfese
0 replies
1d

It’s because of https://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3004&C...

Facebook has had to pay out hundreds of millions of dollars in settlements for related class-action lawsuits, and rather than trying to get informed consent, they’re deciding not to collect biometrics from residents of those states.

wedn3sday
0 replies
23h48m

Well that was spectacularly bad. Failed to translate a single word from english->spanish. Admittedly I was using George Carlins favorites, but if you're trying to have an expressive language translator that refuses to translate "fuck" then what you've got is bullshit.

teacpde
0 replies
1d

As someone working in tech and following along the progression of AI, I believe I have the right expectation. But still feels surreal seeing myself speaking a foreign language in my own speech style.

SillyUsername
0 replies
1d

And that demo is now overloaded and fails to translate the input :D

jeffbee
5 replies
1d

How will Meta put these models into practice? I understand why Google and Apple have models for their mobile OS users, but I don't understand where users for Meta speech models come from. Are they planning to show Instagram videos with English narration in French or what?

polygamous_bat
1 replies
1d

Ads and Reels (their TikTok competitor) I imagine would be the primary use-case. Imagine spreading the "wonders" of TikTok-like videos to non-$native_language speaking world.

dylan604
0 replies
1d

but isn't that a TikTok shtick to use the obviously fake voice in your video?

spacemanspiff01
0 replies
23h36m

The metaverse will not have any language barriers...

solardev
0 replies
1d

Ads in any language!

crakenzak
0 replies
1d

They have arguably the most diverse userbase of any company, with users from pretty much every single country + language across all their services & apps. I could easily imagine a handful of use cases having a high performing universal translation model would be incredibly useful.

zengid
4 replies
23h57m

If "toxic word hallucinations" isn't a cyberpunk phrase I don't know what is.

(quote from the video presentation in the link)

spacephysics
0 replies
23h55m

Oh god they’re gonna censor the output. Time for musk to make a non-censored version lol…

drexlspivey
0 replies
22h10m

I am sorry Dave, "merde" is not in the pre-approved word list

dontupvoteme
0 replies
21h26m

I wonder if it doesn't understand the common colloquial usage of "geil" in German. This sounds like it is going to mess up natural language

albert_e
0 replies
17h35m

I thought he misspoke

I thought it was meant to be "toxic words, hallucinations, etc" in the script

ukuina
4 replies
1d1h

Next step is combining the output with few-sample speech synthesis so the output is in the original speaker's voice!

modeless
3 replies
1d

This does that already. At least, to a first approximation. Voice cloning is not that great in general right now.

coffeebeqn
1 replies
23h10m

Voice cloning works pretty well already but not necessarily on one 10 sec sample as the source data. If you can give it some hours of data it’ll work much better

modeless
0 replies
20h52m

Do you have examples of it working well? I haven't heard anything that really impressed me. Nothing close to a good human impersonator. We're a long, long way from replacing voice actors, even considering the rapid rate of progress.

blovescoffee
0 replies
1d

The voice cloning worked pretty well for me. From english to spanish I noticed that the first few words sounded more like me than the last few words. Also it doesn't sound like how I speak in spanish but that's expected.

mkagenius
4 replies
21h44m

Yet again, Hindi (the major language in India) is not even in the samples. India is the largest user base of facebook (and probably 1/3rd of the engineers working there are Indians) but never will facebook put enough effort to contribute back. Only use the DAU from India in investor calls.

cafed00d
3 replies
21h33m

By "samples" do you mean examples on the marketing/landing page? It sure looks like the model supports many major Indian languages like Telugu, Tamil & Kannada. https://huggingface.co/facebook/seamless-m4t-v2-large

Yeah, I kinda agree with the spirit of your comment; it sure would be nice to see a major Indian language like Telugu on their landing page for sure. But that's just my Indian-person bias speaking.

albert_e
1 replies
17h37m

slightly new to this area

is there an example notebook that shows how we can use this model with own sample audio and text? thanks!

mortimerp9
0 replies
5h45m

I work on seamless and you can find sample code here: https://github.com/fairinternal/seamless_communication or in the HuggingFace space.

mkagenius
0 replies
21h24m

The lack of focus shows up in the results. The models never performs as good as french or spanish on Indian languages. This goes for Google, too.

ziptron
3 replies
1d

If you are multilingual but have young children and plan to continue residing in your current English speaking country for the foreseeable future, are you opting to teach your children those additional languages or are you adhering to the idea that they can always learn those languages later if necessary, considering it might not be essential (esp with models like this)?

esafak
2 replies
23h42m

It is easier to learn multiple languages when you are young.

robga
1 replies
23h29m

There isn't a lot of good evidence behind this popular conception.

If anything, the evidence is that it isn't true, see https://journals.plos.org/plosone/article?id=10.1371/journal...

Any apparent causality of age of acquisition seems to be a proxy of hours of exposure. It may well be that it is easier for young people to rack up a lot of exposure to a second language, but not much evidence that age plays much of a factor for people of different ages who had the same degree of exposure.

debugnik
0 replies
21h28m

we argue that the late learners resort to computationally less efficient processing strategies when confronted with (lexically determined) syntactic constructions different from the L1.

we show that the ERP signal in response to grammatical violations depends on the AoA of an L2 learner, as well as on the regularity of the structure under investigation. In (lexically determined) syntactic constructions different from the L1, we found a gradual change in processing strategies that varies by AoA, with a native-like effect for early learners and a less efficient neural processing strategy for later starters.

Although they do clarify that these effects could be confounded with age of acquisition instead of it being the cause.

yread
3 replies
22h54m

Does the spanish expressive sample sound muffled for others too? And the french sounds super mechanical. Hopefully, it's more impressive the other way.

Also: "This research demo is not open to residents of, or those accessing the demo from, the States of Illinois or Texas"

grogenaut
0 replies
22h0m

Illinois is possibly because they don't allow storage of biometric data without express permission and I believe explicit usage restrictions. So I bet they're keeping all of your utterances, which would violate that law.

dentalperson
0 replies
22h37m

Yes, they all have significant 'ghosting' artifacts where the harmonics are a bit fuzzy if you listen closely. AFAIK all of the recent neural speech engines have this, from SoundStream to EnCodec, especially in low latency causal setups. Wavenet was a bit better in that regard but has fallen out of style due to complexity and the lack of a bottleneck. It seems like something diffusion post processing would be able to clean up.

TacticalCoder
0 replies
22h23m

The "expressive" example in french exhibits a thick accent which bothers me more than the mechanical aspect of the non-expressive french example.

It's not dissimilar to some kind of a "ch'ti" / "chtimi" accent or a belgian-french accent (which is not dissimilar to the french ch'ti accent, heard in some part of the north of France. "Ne partez pooooo" (with a longer "a" which sounds nearly like an 'o': that's not proper french at all) instead of "Ne partez pas".

That's said I'll take the non-expressive accent any day over subtitles for when watching video in a language I don't understand: it's clearly good enough.

wg0
3 replies
1d

And just the other day StyleTTS[0].

Just text to speech has gone too far. Audio books would be mainly generated on the fly like this?

I think some RPGs in some 5 years time might have something like this:

- A text file that outlines characters and a lose plot/Story line. Human written.

- 3D Mesh Generation based on character description via Transformers based models. Auto generated.

- Dialogues for each NPC via LLM.

- This TTS engine again based on such models.

Result - almost unlimited replayability. Or even edit text file, have a new world based on a new story line with characters having different personas.

[0]. https://news.ycombinator.com/item?id=38335255

mpalmer
1 replies
23h8m

How has TTS gone too far?

wg0
0 replies
20h20m

Came a long way, that is. From the days of let's say if I recall correctly, from Windows 98 screen reader.

mortimerp9
0 replies
5h44m
infotainment
3 replies
1d1h

It’s amazing how far text to speech has come in the past few years, but what I’m wondering is when this tech will finally make it into local TTS engines baked into the OS (eg for screen readers, etc)

callalex
1 replies
21h43m

This is already built into recent iOS devices and it’s called Live Captions.

freedomben
0 replies
19h19m

Same with Android (Pixel phones at least).

I'm the most excited for an open source one though, and it would be incredible if this could become it. I do 95% of my compute on desktop linux and it sucks being behind.

PartiallyTyped
0 replies
1d

The accessibility nerd in me is excited!

whbrown
2 replies
1d

Can anyone help demystify the licensing?

Besides the ACCEPTABLE_USE_POLICY, there's a CC BY-NC 4.0 (NonCommercial) license, a 'SEAMLESS_LICENSE' (NonCommercial), but also an MIT license? It would seem these other licenses contradict the MIT license, could somebody help clarify how these all interact in practice?

disattention
0 replies
1d

The license details are listed on the project GitHub

https://github.com/facebookresearch/seamless_communication#l...

dankle
0 replies
1d

MIT for the code, NonCommercial for the trained models I bet.

nathanfig
2 replies
22h24m

Impressive work, really excited for this.

I will note though that I feel safer getting an occasional bad word than I do having a translator straight up deceive me.

For example, "what the fuck" in English->Spanish is giving "qué diablos" output. Definitely toning down the meaning there.

If someone says something mean to me, I want to know it.

jonathanlb
1 replies
21h49m

This may be an intentional decision given that there are several ways to say "what the fuck" in Spanish, such as "qué mierda" or "qué carajos". And that's not including regional expressions like "qué coño" or "qué chingados". So, saying "qué diablos" may be the most common expression across dialects conveying the same meaning.

nathanfig
0 replies
21h43m

Yeah could be, I still need to read the paper to better understand the safety tuning.

Would be interesting to see some work stress-testing the ability to convey ill-intent across multiple languages. Accurately conveying ill-intent is safety-critical for the person being threatened.

anonzzzies
2 replies
1d

How far from a real-time Star Trek translator? Whisper is fast enough and light enough, LLMs are getting there, so it’s close isn’t it?

Sol-
1 replies
1d

Seems like there will always be latency, because it's not possible to easily stream over languages that have different structure. You need to wait a bit before you can start faithfully translating the meaning.

They also mention it in one of the videos about the streaming variant of their translator. But I guess 2s delay or what they mention is close enough for practical purposes.

I feel like for personal relationships where true real-time is required, having a computer intermediary would be weird anyway and you have to learn the language, at least for the time being and as long as personal relationships are still relevant (in the post-AI world they might not be).

forgot_old_user
0 replies
22h21m

You need to wait a bit before you can start faithfully translating the meaning

I guess it's possible that the AI learns about a specific person over time? That way it can be confident about what's being said as soon the person starts saying it

WhatsName
2 replies
1d1h

Did anyone compare this to nllb (also meta) yet?

trovas
0 replies
1d

in the paper, the results reported show very similar level of quality

jkw
0 replies
1d

We're the same team! We have some comparisons in the paper.

StrangeDoctor
2 replies
1d

Any more info about the watermarking? Only Meta can make the determination?

Edit: I can’t find the weights but if I’m reading the paper right anyone could train their own detector.

hadyelsahar
1 replies
22h44m

Hey! a RS from Meta seamless team here.

Yes, we chose not to release the watermark detector to safeguard against adversarial attacks. This decision helps prevent any attempts to erase the watermark by malicious users.

The watermark generator and detector are trained together, one can use the information in our paper to train your own generator and detector model, however in this case the watermark signature created will be distinct from the one we use to protect our seamless translation models. This approach ensures each model maintains its unique security features.

StrangeDoctor
0 replies
22h0m

Thanks for clarifying, and seems like a completely reasonable approach. Thanks for the great work.

trinovantes
1 replies
22h21m

Currently Steam bans games from using AI-generated assets (for good reason). I wonder if they'll back track on this or carve exceptions because this tech seems really useful for indie devs to add voice work to their otherwise silent games.

yjftsjthsd-h
0 replies
20h34m

Very speculative amateur opinion: My understanding is that Valve didn't exactly ban AI, they banned AI that was fed copyrighted works that could possibly make the results copyright infringement ( https://www.theverge.com/2023/7/1/23781339/valve-steam-ai-ar... ). (Side note: Regardless of individual views on whether AIs are just copyright regurgitaters or not, I can understand Valve being cautious until courts have actually decided.) So if speech models can be made purely from assets that their creators can prove they have the rights to use, it would probably be easy enough to get it approved.

stephc_int13
1 replies
23h7m

As a French native speaker, I am surprised by the low quality (frankly ridiculous) voice of the French translation example.

Especially because the head of AI at Meta is a French guy AFAIK (Yann Lecun).

sangnoir
0 replies
22h2m

They are optimizing for speed (low latency)

nextworddev
1 replies
23h19m

RiP elevenlabs?

Hakkin
0 replies
18h8m

I tried to do Japanese -> English for multiple audio snippets using the Seamless huggingface demo and all of them output complete gibberish. Really makes me question how many of the languages they claim to "support" are actually usable. ElevenLabs at least produces a result that resembles the input, so they still have the edge in some places.

mightytravels
1 replies
20h52m

Like how easy it is to get going but you need to download about 20GB and s2st needs 40GB GPU RAM!

It runs but any audio input (you will need to provide wav not mp3's) I tried (tried 20s/40s/300s) I get just one short sentence returned in target language that seems not related at all to my audio input (i.e. Tous les humains sont créés égaux).

Seems like some default text but it runs on full GPU for 10 minutes. Tons of bug reports in GitHub as well.

Text Translate works but not sure what is the context length of the model. Seems short at first glance (haven't looked into it).

Oh and why is Whisper a dependency? Seems not need if FB has their own model?

mortimerp9
0 replies
5h36m

Hello, I work on seamless.

It runs but any audio input (you will need to provide wav not mp3's) I tried (tried 20s/40s/300s) I get just one short sentence returned in target language that seems not related at all to my audio input (i.e. Tous les humains sont créés égaux).

You might want to open an issue on github for that one. The model is made to work on short utterances, if you have a long speech, you'll want to segment it first. I've tried "tous les humains sont créés égaux" on the demo: https://seamless.metademolab.com/expressive (which runs the same code as in the repo) and the output was correct. Maybe there is something wrong going on in the conversion of the input audio?

Oh and why is Whisper a dependency? Seems not need if FB has their own model?

Whisper is a dependency as it's used as a baseline for evaluation. You can check out the paper for explanations.

kapp_in_life
1 replies
1d

Neat. How translatable are tones of voice for intent across languages? Like does a person trying to do a "nerdy" voice(nasally, whiny, etc.) in English translate to the "nerdy" stereotype for a French speaker. Seems to do very good on whispers which made me wonder what could be next.

jeffbee
0 replies
1d

If you don't speak the language into which these models translate your inputs, how do you know if or why the model has generated, without being commanded to do so, a campy American gay male sociolect, or an African American regional accent, or some other thing that may convey unintended meaning to native listeners?

iFire
1 replies
22h54m

LICENSE

Attribution-NonCommercial 4.0 International

https://github.com/facebookresearch/seamless_communication/b...

iFire
0 replies
22h53m

Took me 2 minutes to find the Github.

gagabity
1 replies
1d

I had pretty terrible results when I tried English -> Swahili I'm using the Huggingface M4T V2 spaces, it pretty much doesn't work most of the time and I just get English back with a different voice, Expressive on the other hand only has a few languages it seems.

It would be nice if they could layout what exactly is missing in terms of data to make a language work better, while the actual AI bit is out of reach for most of us maybe we could provide more data.

There is also a 60 sec limit and wonder if this is HuggingFace limitation or Seamless?

yorwba
0 replies
1d

maybe we could provide more data.

If you want to contribute by recording yourself speaking Swahili, https://commonvoice.mozilla.org/sw is the place to go. Although Meta has access to much larger data sets, they nonetheless use Common Voice as a "known good" source. E.g. the paper on their SONAR speech encoder reports experiments on Common Voice data, coincidentally involving Swahili https://ai.meta.com/research/publications/sonar-sentence-lev...

apwell23
1 replies
1d

.

jvolkman
0 replies
1d

The Google Translate app has a conversation mode.

Jayakumark
1 replies
1d1h

How does this compare to whisper-large-v3 on STT?

trovas
0 replies
1d

I work on seamless. You can see the results in the paper. M4Tv2 is significantly ahead (Whisper Large v3 - 16.9 BLEU vs. M4Tv2 26.6). These are averages over 81 directions X->english

xnx
0 replies
21h34m

This tech from Google seems similar, but doesn't have a fancy demo: https://blog.research.google/2023/12/unsupervised-speech-to-...

troseph
0 replies
23h47m

I feel like naming something "seamless" is not dissimilar to calling the Titanic unsinkable.

tambourine_man
0 replies
23h31m

Every video in this page is a bit out of sync with the audio. Combined with the blandness of feature expressions and the whole mood in general, I kept waiting for the moment when the video would disclosure that everything on it was created by AI.

sargun
0 replies
6h36m

It’s funny, all the humanities types try to push the proliferation of languages, but the engineering types keep trying to reduce the language barrier.

rammer
0 replies
19h53m

Marketing has been heavily involved in this page...there's at least one coloured person for every white photo..

quickthrower2
0 replies
20h3m

How did that page get camera access without my permission?

Edit: by the upvote I guess it wasn't just me?

polygamous_bat
0 replies
1d

    "The Babel fish is small, yellow, leech-like, and probably the oddest thing in the Universe. It feeds on brainwave energy received not from its own carrier, but from those around it. It absorbs all unconscious mental frequencies from this brainwave energy to nourish itself with. It then excretes into the mind of its carrier a telepathic matrix formed by combining the conscious thought frequencies with nerve signals picked up from the speech centres of the brain which has supplied them. The practical upshot of all this is that if you stick a Babel fish in your ear you can instantly understand anything said to you in any form of language. The speech patterns you actually hear decode the brainwave matrix which has been fed into your mind by your Babel fish.
    "Now it is such a bizarrely improbable coincidence that something so mind-bogglingly useful could have evolved purely by chance that some thinkers have chosen to see it as a final and clinching proof of the non-existence of God.

    "The argument goes something like this: 'I refuse to prove that I exist,' says God, 'for proof denies faith, and without faith, I am nothing.' 'But, says Man, the Babel fish is a dead giveaway, isn't it? It could not have evolved by chance. It proves you exist, and, by your own arguments, you don't. QED.' 'Oh dear,' says God, 'I hadn't thought of that,' and vanishes in a puff of logic."

pnut
0 replies
1d

I was hoping to find out, that the actor's voice in the demo video was generated, or that he had recorded the video speaking in another language or something.

That would have been the knockout punch.

novok
0 replies
20h40m

I wonder how well this will perform for automatic comic's translation. Current local models are pretty bad.

m3kw9
0 replies
13h44m

make a .llamafile and we'll use it.

kaycebasques
0 replies
1d1h

Besides the obvious good news about making it easier for people to communicate with each other across languages, it's also exciting to me that we're trending towards a world where I can tap into all the knowledge that only exists on the non-English web. I'm sure there are vast troves of programming knowledge in the Japanese-only web for example. The Chinese-only and Russian-only web are obvious candidates too but presumably those are harder to access for other reasons.

jwineinger
0 replies
21h7m

Any ideas on what kind of hardware this would require to run S2ST?

gorbypark
0 replies
21h39m

I've been trying (and mostly failing) at settings up a pipeline to get system audio into whisper and feed that transcription into a seamless m4t text-to-text translation model. It seems like seamless streaming is going to solve most of my issues, and should significantly reduce latency!

My ultimate goal is to have realtime translations of video conferences. I've moved to a new country, and while I'm super privileged that most of my colleagues speak English, we still have a number of "all hands" meetings that I get lost in pretty easily.

gloyoyo
0 replies
21h2m

This is so world changing! Exactly how I wanted to speak so confidently!

Thank you Meta!

gagabity
0 replies
21h47m

Can this do speech to text English -> English? Get strange results if I do a translation to the same language would be an interesting alternative to Whisper if it could.

denton-scratch
0 replies
55m

I don't see how realtime voice translation can ever be possible; to properly translate the first half of my sentence, you need to hear the whole sentence first. I don't know how simultaneous translators can translate from verb-at-the-end languages like German, until they know the verb.

It's not just where the verb is; sometimes I say something ambiguous, and my next utterance is supposed to acknowledge and remedy that. But if that ambiguity doesn't exist in the target language, I don't see how a simultaneous translator can convey the ambiguity, without knowing how the next utterance is going to refer to it.

Maybe that's why human simultaneous translators often seem to stumble or backtrack. I've never met someone whose job was simultaneous translation. It must be very difficult.

I'm impressed by this effort to convey non-linguistic elements of speech in translation. It's quite an achievement, and a very ambitious goal.

Aside: I wish I knew how speakers of tonal Chinese dialects express feeling, when tonality is supposed to convey semantics. When I hear chinese speakers, I can "hear" the feeling, but I don't know how they do it - it can't just be down to emphasis. (I learned some mandarin 50 years ago, at school. I learned the tones, but they didn't teach expression; and I was never taught by a native speaker, although there were language-lab tapes.)

btbuildem
0 replies
21h59m

The near-realtime aspect of this is so promising -- we're getting closer and closer to IRL babelfish!

What I would love to see is an ability to add my own voice (yes, at the risk of deepfakes) so that the model could "speak" in any language and sound more like me, not some random voice actor it was trained on.

bsza
0 replies
23h42m

"We need access to your microphone and camera to record your voice and translate it with your expressions."

None of the videos shows any modified/lip-synced footage. There doesn't seem to be a reason for this thing to need access to my camera.

Also, using it with tape over the camera doesn't seem to work either. (Perhaps it needs to see facial expressions in order to work?)

bozhark
0 replies
18h43m

I want this as a channel in our discord.

Would allow more interactions of people that don’t speak the same language

beders
0 replies
1d

I'm thrilled to see the progress made in the last 30 years.

As a student in the mid-90s I worked on a system called Verbmobil at the German Research Center for AI and it did speech-to-speech for English, German and Japanese in very limited domain.

This was done via "classical" NLP: You had to model the domain with concepts, you needed sentence parsers, semantic engines, speech-to-text hand-crafted for 3 languages etc.

As it turns out, this approach is/was a dead-end.

asylteltine
0 replies
18h44m

It really sucks that a company so irresponsible with all your data is one of the leading AI companies now.

TheCaptain4815
0 replies
1d

The demo is so much fun to use. I can't wait for all these technologies to start integrating into filmmaking / games.

Reubend
0 replies
23h13m

Wow, after trying out the demo, I'm floored by how high quality this is. The translations worked perfectly, the voice cloning was "good enough", and the emotions conveyed in my voice was retained pretty accurately.

I don't think this would fool anyone that I was a real native speaker of the target language, but for casual conversation this would work pretty much perfectly. It basically avoids all of the traditional pitfalls of machine translation, like the unnatural robotic voice that it outputs, the slow translation speed and huge latency for realtime conversation, and the loss of emotion.

MagicMoonlight
0 replies
20h39m

Automatically filters out toxic speech >Watermarking

So it can't be trusted at all then

I_am_tiberius
0 replies
21h44m

I hope all these AI products will have privacy focused alternatives quicker than when web2 happened.

Havoc
0 replies
23h39m

Can this also do straight tts or is it translation only? Is t quite clear to me from the site