return to table of content

OpenVoice: Versatile Instant Voice Cloning

jamespattn
30 replies
1d

Can someone give me a practical use case where this adds a net benefit to society?

Lerc
9 replies
1d

Unifying a voice in tutorial videos so that the difference in voice does not distract the learner.

Auto non-toxic rephrasing of online chat in video games, let people hear their voice but paraphrase what they said in a manner that doesn't turn the platform into a cesspit.

Cloning your own voice so that you can turn a script into audio without 50 takes and then having to remove a million Ums and errs.

paradox460
4 replies
1d

Real time translation in the speakers own voice.

abetusk
2 replies
23h22m

This is an exceptional use case!

Mr Beast talked about translating his videos to other languages to get more reach. This can be done for people with limited budget or just in general so people can watch videos without needing subtitles.

I wouldn't be surprised if we saw this incorporated into YT in the near future.

paradox460
1 replies
18h22m

Another really good one would be for RPGs. Instead of clumsy approaches like "Hey Dragonborn" and whatnot, they could actually say your character's name out loud.

abetusk
0 replies
18h2m

Right, and taking it one step further, LLMs in games with voice actors providing the basis for dynamic dialogue that sounds like it's coming from a person.

treetalker
0 replies
23h53m

While listening to the examples given, I noted the cross-language ones. I’m eager to improve my accents in my nonnative languages by cloning my voice and comparing recordings of how I do sound with how I would sound as a native speaker!

grayhatter
3 replies
1d

Auto non-toxic rephrasing of online chat in video games, let people hear their voice but paraphrase what they said in a manner that doesn't turn the platform into a cesspit.

that feels very orwellian

ben_w
2 replies
23h49m

George Orwell — 'If you want a picture of the future, imagine a boot stamping on a human face—for ever.'

I think this is closer to the direction of Huxley in Brave New World, where a deeper understanding of how to manipulate without brute force creates a very different dystopian society than 1984.

haroldp
1 replies
21h40m

"Don’t you see that the whole aim of Newspeak is to narrow the range of thought? In the end we shall make thoughtcrime literally impossible, because there will be no words in which to express it."

ben_w
0 replies
21h2m

Censorship by itself doesn't stop people thinking (or even expressing) forbidden thoughts, it stops a person's words reaching other people.

BNW had a similar effect by conditioning, rather than by applying the strong form of the Sapir–Whorf hypothesis.

grayhatter
6 replies
1d

No.

The real answer is yes, I could probably come up with some contrived examples, like I lost my voice in a freak LLM accident and now want to clone my old voice. But this doesn't (you don't?) really need a net benefit reason to figure it out and publish it. Because why? I assume, because "this shouldn't exist!" which is just a more palatable wa to phrase "won't someone think of the children".

Society doesn't benefit from ignorance, so given it can exist, what's the problem with it existing? Why does it need a practical reason? Because people will do bad things with it? Duh, but I'd rather everyone know then just the bad guys

johnnyworker
2 replies
23h45m

Why does it need a practical reason?

To at least give us something as a consolation for all the havoc all sorts of deep fakes will wreak on societies. It's like asking what a knife can be used for other than murder. It's a valid question.

grayhatter
1 replies
17h44m

It's a valid question, but not a good one. The implication is something needs a reason to exist. This is a new technology, and just like all new technologies, the fear of it is spreading faster than the understanding.

Scams aren't going away. Will this make it easier to scam some people? Absolutely, so did the internet. I'm not claiming this is anything like the internet. My argument closer to, the reason people get scammed isn't because [thing exists] it's because bad people lie, and kind people trust them. We can all wring our hands in fear over what the new technology might do, or we can solve the problems we care about. Authenticity was hard before this, and it'll be hard after.

johnnyworker
0 replies
9h6m

But we live in a concrete society, [and] with concrete social and historical circumstances and political realities in this society, it is perfectly obvious that when something like a computer is invented, then it is going to be adopted will be for military purposes. It follows from the concrete realities in which we live, it does not follow from pure logic. But we're not living in an abstract society, we're living in the society in which we in fact live.

If you look at the enormous fruits of human genius that mankind has developed in the last 50 years, atomic energy and rocketry and flying to the moon and coherent light, and it goes on and on and on -- and then it turns out that every one of these triumphs is used primarily in military terms. So it is not reasonable for a scientist or technologist to insist that he or she does not know -- or cannot know -- how it is going to be used.

-- Joseph Weizenbaum

That is not fear. That is being serious and unflinching, if anything.

We can all wring our hands in fear over what the new technology might do, or we can solve the problems we care about.

I'm doing neither. I said it's a valid question, with which you agree. The rest is a straw man apropos nothing anyone actually said, here, and wringing your hands about it. It's a way bigger waste of time than asking a simple question and let those who want to answer that, and let those who don't want answer it simply don't answer it, instead of making up this "issue" with the question itself.

Authenticity was hard before this, and it'll be hard after.

So "nothing changes", but technology is super important? You could say the same about, say, curing cancer. People will live for a while and then die, with or without it. Why since it makes no difference, what'd be the problem with "fearing" it?

lbrunson
1 replies
20h25m

By this logic there shouldn’t be regulation on anything, because the bad guys will have it any way.

While you can’t make it go away, you can disincentivize propagation and use which can be the difference between thousands of cases of scams/extortions and millions. Until there’s a stronger argument for voice cloning models (talking to a dead loved one is creepy and not a positive argument) then we shouldn’t encourage tools with overwhelmingly nefarious utility.

grayhatter
0 replies
17h52m

That's correct, I believe it shouldn't be illegal to know anything. Nor do I think science needs any kind of regulation.

Hurting people, lying, that's already illegal.

I think Maybe you misunderstood my argument. My argument isn't that good guy with a voice cloaner is the only thing that can stop a bad guy with a voice cloaner. That's, as you pointed out, stupid. My argument is that no one benefits if how easy it is to make one remains a secret to everyone but the bad guys.

jamespattn
0 replies
21h10m

My question wasn't to imply that I don't think a given technology should or shouldn't exist.

I was curious to see if anyone could name at the top of their head some practical use cases that they feel net out the potential harms of cloning and misusing someone else's voice.

There's some nice and certainly practical examples, but I don't feel any of them would net out the harms.

Perhaps there's a use case that we can't even comprehend yet that would though!

stale2002
2 replies
1d

Well we could just look at the obvious and existing use cases for text to speech stuff.

Alexa, siri, and similar, are all common place.

Another huge usecase would be anything to do with voice acting. Either in video games, cartoons, or the like.

This would completely democratize voice acting material, and would empower anyone to be able to do this for cheap.

mattlondon
1 replies
20h55m

... and put 99% of voice actors out of business. We'll eventually end up with every TV show, movie, and, video game being voiced by Ryan Gosling and Beyonce because market research.

stale2002
0 replies
10h51m

Technology that makes things better, faster, and more democratized does tend to harm those who profit off of things being expensive and gatekept, yes.

Democratization will always be the enemy of those who profit from preventing others from being empowered.

userbinator
0 replies
19h46m

You would be able to translate media into the language of your choice, but also retaining the original voices.

shinycode
0 replies
1d

Aside from the fact that is will be easier to scam people, I fail to see benefits. We can already translate everything with the same synthetic voice

nickpsecurity
0 replies
1d

My pastor has an injured, vocal cord that makes him sound gritty at times. A technology like this applied to old copies of his speaking might make him sound like he used to. I don’t know if he’d use something like that since we mostly rely on the Spirit of Christ to open hearts to the truth.

Outside public speakers, there’s probably other people whose lost their voice or have trouble vocalizing who might want to sound like their old selves. This could help them.

Disclaimer: I think these techs will more often do damage than good. I’m just brainstorming an answer to your question.

ldoughty
0 replies
1d

From an indie game dev standpoint, I can probably say a sentence or two in a given way using my standard headset microphone.. and something like this would allow for clean voice lines fairly easily, as long as they don't need to stress too much emotion... But for a $0 game, that would still be beneficial. Imagine all the 2D Zelda/FF like games that don't get played today because people would rather listen to dialogue than read.

Of course, there's also the preservation of the voice of a loved one. I would probably pay to hear my father's voice again but there"s probably only one or two VHS tapes with his voice on it.

kushie
0 replies
1d

apple has Personal Voice for accessibility

goodluckchuck
0 replies
1d

Possibly speech therapy.

Certainly entertainment. Movies / TV. It opens a new opportunity for videogames with generative characters.

dqv
0 replies
23h56m

If you've ever done voice prompt recordings for a phone system, voice cloning would be super helpful for doing one off tweaks, especially if you have to record a bunch. Instead of rerecording 20 messages, which can sometimes take hours, you can use a clone of your own voice to make the necessary modifications. My friend does a lot of recordings as part of his job and when I showed him the Adobe voice editing preview he got really excited. It has the potential to make tweaks a lot easier, less time consuming, and reduce voice strain.

diggan
0 replies
1d

Person A used to be able to speak, but lost their voice in a accident/because of reason Y. Luckily, there is surviving audio/video with their voice on it, so a text-to-voice with their own voice could be created for them to use.

cchance
0 replies
18h54m

Imagine being able to handle translations live and hearing the persons voice translated as if they were speaking to you in your native language with their own voice is a big one

abetusk
0 replies
1d

James Earl Jones, presumably hedging against his eventual demise, has allowed his voice to be used for things like the Star Wars franchise [0].

Small, independent film makers can now use a skeleton crew to voice parts.

I can't imagine it would be anything other than a niche service, but hearing the voice and, potentially, interacting with a chatbot/LLM with the voice of a passed love one.

This is off the top of my head. I would also guess that this technology is a stepping stone for other weird, interesting and profoundly helpful uses.

[0] https://www.theverge.com/2022/9/24/23370097/darth-vader-jame...

colesantiago
20 replies
1d2h

This repository is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which prohibits commercial usage. MyShell reserves the ability to detect whether an audio is generated by OpenVoice, no matter whether the watermark is added or not.

So it is not 'open' then and you cannot make money out of this?

DandyDev
15 replies
1d1h

It is open, just not by your definition. You can view, use and modify the code to your hearts content. Sounds pretty open to me!

cjbprime
3 replies
23h30m

And not by opensource.org's definition, which prohibits use restrictions. It's not reasonable to act like OP is being idiosyncratic when this fails to meet the protected definition of "open source".

gpm
1 replies
23h27m

The term "open source" is not protected, the OSI (opensource.org) attempted and failed to acquire a trademark on that term.

cjbprime
0 replies
22h52m

Fair enough. Is there any shared definition of "open source" which permits use restrictions, then?

DandyDev
0 replies
11h18m

I admit not to have read the whole paper, but in the intro nowhere do they mention “open source”, so it seems unfair to measure them by that definition

bbor
3 replies
1d

Well… “use” isn’t exactly free, this the complaint. On a scale of free to not free, “cannot use this for my work” is a pretty big jump to the latter end IMO

c0pium
2 replies
1d

Careful, you’re saying the quiet part out loud; freedom is about profiting off the uncompensated work of others.

bbor
1 replies
23h33m

Well ultimately we all need to eat. If someone wants to be compensated in today’s society, they either need to join a gift-based sub-society (see: OSS foundations, NGOs in general) or sell something. Trust me, I totally agree that freedom of information should be a completely separate concern from resource allocation

EDIT: I guess there’s a third option, “work another job and use OSS on your off hours”. Which feels… idk, disrespectful of the whole enterprise. OSS software development is important enough to deserve a wage IMO, to say the least

satvikpendem
0 replies
22h35m

To your last point, people pay what the market will bear. In this case, it's free, so don't be surprised that if you give something away for free that people, well, take it for free. Importance has nothing to do with it.

abetusk
3 replies
23h26m

By the commonly held definition of open, in the context of "open source", it is not open.

You can view, use and modify the code to your hearts content.

The non-commercial clause of their license specifically prohibits commercial use, so we cannot use this source, and presumably the data that the source uses, to our hearts content.

The OSI has a definition of open source that clearly states commercial use is required [0].

Wikipedias entry on Open Source Licensing also stipulates that commercial re-use is required [1].

There is a term called "source available" which is more in line with your intent.

[0] https://opensource.org/osd/

[1] https://en.wikipedia.org/wiki/Open-source_license

DandyDev
1 replies
11h17m

Where do they claim to be “open source”?

abetusk
0 replies
3h32m

In their README in the GitHub repository as well as the paper. I opened an issue [0] and it looks like they've updated their README, at least, to reflect that it's not open source.

[0] https://github.com/myshell-ai/OpenVoice/issues/16

nl
0 replies
16h40m

commonly held definition of open, in the context of "open source", it is not open

While this is very true, the context of "open source" can't be assumed.

jahewson
0 replies
1d1h

Open for business!

No wait…

beardog
0 replies
1d

As long as your hearts content isn't commercial

CaptainFever
0 replies
1d1h

To be specific, while it is not a bad license, it does not quality for the Free Cultural Works mark as defined by the Creative Commons and Freedom Defined: https://creativecommons.org/public-domain/freeworks/

throwup238
3 replies
1d1h

You can’t. Scammers who don’t care about noncommercial licenses sure can!

evanmoran
1 replies
1d

This is the most insightful take. Licenses like this prevent certain businesses in certain countries, but it is quite harmful as it adds a powerful tool for propaganda/scammers/etc who don’t care about the laws.

Additionally, it only really hurts small businesses & startups as the big companies all have teams that can make their own version or pay for 3rd party apis for easily. So yeah, us startup folks won’t like this license much as it basically is aimed at us the most.

Either way, congrats with the tech. It does look very impressive!

cyanydeez
0 replies
1d

erm, it's existence provides to scammers.

unless you're proposing it's use in detecting itself is some how symmetrical, which I really don't think is anything but unproven conjecture.

pclmulqdq
0 replies
1d

Yep, this is one of those "only bad actors" licenses, probably as a cash grab.

It will definitely stop those bad actors from scamming people this time, right? Right?

huqedato
16 replies
1d1h

Welcome to the new era of fakes and scams beyond our wildest imagination !

danielbln
10 replies
1d1h

Elevenlabs has been around for a while now. Genie has been out of the bottle for a bit, and the sooner the notion that anything digital can be easily faked seeps into the wider consciousness the better. Trust nothing.

smt88
4 replies
1d

* the notion that anything digital can be easily faked seeps into the wider consciousness the better. Trust nothing.*

This is a society-destroying idea.

Most of us, especially younger people, only know how to vote, where there are wars, or even what our parents are doing by using digital media.

If digital media becomes untrustworthy, everyone will live in a warped and fragile alternate reality that no one can agree on.

diggan
3 replies
1d

Trust nothing

This is a society-destroying idea.

Believe it or not, this is how much of the population saw The Internet when it first came close to being mainstream. Everyone and their mother said "Don't believe anything you read on the cybernet", which ended up ironic as everyone and their mother ended up being the ones to believe anything on the cybernet anyways.

everyone will live in a warped and fragile alternate reality that no one can agree on.

How is this any different from today? The various corners of the internet (which is mostly divided by languages: English, Russian, Spanish, Chinese and Portuguese) already have these vastly different realities and ground-truths.

I'm sure we could survive another Internet-Winter where people trust everything a bit less than today.

smt88
2 replies
22h54m

It's vastly different than today because today (or at least a few years ago), I could trust videos and voices delivered digitally. I can't do that anymore.

thomashop
0 replies
8h52m

How long has society had voice and video delivered digitally? We managed to survive fine before we had it.

If it now becomes impossible to trust a voice received through the internet without being connected to a verified telephone number I don't know how that can be classified as society-changing.

ilikehurdles
0 replies
21h2m

Technology and society will adapt, just as we adapted encryption to verify credentials and secure banking data online, we'll end up with a validation signal for video and audio.

ignu
3 replies
1d

I've seen some prank calls (a YouTuber cloned Tucker Carlson's voice and called Alex Jones) but he just had a sound bank with a few pre-generated lines and it fell apart pretty quickly.

At least for now there's too much lag to do a real time conversation with a cloned voice.

Speech to Text > LLM Response > Generate Audio

If that time can shrink to subsecond, I think there'll be madness. (Specifically thinking of romance scammers)

whywhywhywhy
0 replies
6h1m

You don't need an LLM Response

shinycode
0 replies
1d

Awful, bots on their own having real conversations with people with the voice of a loved one. Scamming on steroids

ben_w
0 replies
23h44m

At last summer's WeAreDevelopers World Congress in Berlin, one of the talks I went to was by someone who did this with their own voice, to better respond to (really long?) WhatsApp messages they kept getting.

It worked a bit too well, as it could parse the sound file and generate a complete response faster than real-time, leading people to ask if he'd actually listened to the messages they sent him.

Also they had trouble believing him when he told them how he'd done it.

ethanbond
0 replies
1d1h

It can both be true that people need to adapt/“trust nothing,” and that this is bad.

underlines
1 replies
1d1h

This aera is barely new. Look at how old some of the projects are:

https://github.com/underlines/awesome-ml/blob/master/audio-a...

The thing that changes is the complexity to run it. I was training my wife's voice and my voice for fun and needed 15min of audio and trained on my 3080 for 40 minutes.

Now it's 2 Minutes.

thfuran
0 replies
1d

Yes, and the more accessible it is, the more widespread it will be.

ponector
1 replies
23h28m

Maybe this will teach people to rise up awareness and take personal security serious. Like not to trust anyone who is calling, especially from legacy line. Phone number and voice could be easily cloned.

dijksterhuis
0 replies
16h25m

The problem is that it's not just personal security, and it requires significant expertise / research to identify.

https://www.bbc.co.uk/news/world-africa-66987869

treprinum
0 replies
1d1h

VALL-E is on Github for over a year already...

peddling-brink
15 replies
1d2h

GitHub: https://github.com/myshell-ai/OpenVoice Checkpoint: hxxps://myshell-public-repo-hosting.s3.amazonaws.com/checkpoints_1226.zip

(Checkpoint link defanged because I’m allergic to direct links to zip files hosted on Amazon. Nor have I reviewed what the file contains.)

crazysim
11 replies
1d1h

Thanks for the link to the repo. It's very useful.

As for the checkpoint, I'm not allergic and I don't do security theater:

https://github.com/myshell-ai/OpenVoice?tab=readme-ov-file#i... links to

https://myshell-public-repo-hosting.s3.amazonaws.com/checkpo...

peddling-brink
10 replies
1d

Why do you call that security theater? I found and provided the information, but didn’t make it clickable. Anyone can decide for themselves to navigate there.

Your comment comes off as passive aggressive.

IshKebab
8 replies
22h19m

I think he's referring to your "defanging" which you implied was security related but doesn't actually achieve anything at all.

fieldcny
7 replies
21h31m

They are making you think about what you are doing before you click the link. that’s not theatre that’s keeping people from clicking arbitrary links to zip files which can auto-execute code once downloaded.

I’d suggest that those who think it is theatre probably don’t understand the implications of that action.

IshKebab
4 replies
19h58m

We understand exactly the implications of that action. There are no implications.

Simply downloading a zip from Amazon has zero risk. Even opening an arbitrary zip has essentially zero risk. RCE from opening a zip is obviously a really critical and valuable vulnerability and would not be wasted with a public link.

Combine that with the fact that this comes from a voice cloning GitHub repo and the chance of this having some 0-day are infinitesimal.

Finally just making the link non-clickable does not add security. Nobody can take any action to increase their security because they have to slightly edit a link (not that they would because it's sensible a clickable link in the GitHub readme).

So yes, I fully understand the implications and it is definitely security theatre.

I suggest that those who think that it isn't probably haven't really thought about the threat model.

peddling-brink
3 replies
18h6m

I'll be honest. You've put way more thought into this then I did.

But in the spirit of hacker news, I'll continue the argument.

There are no implications.

Untrue and absolutist.

Simply downloading a zip from Amazon has zero risk.

Agreed.

Even opening an arbitrary zip has essentially zero risk. RCE from opening a zip is obviously a really critical and valuable vulnerability and would not be wasted with a public link.

Broadly agreed. History is full of unzip vulns, but I agree that this doesn't seem likely. Much easier to persuade folks to deliberately run their malware by using the latest fad as a hook. I'm not claiming that happened here.

Combine that with the fact that this comes from a voice cloning GitHub repo and the chance of this having some 0-day are infinitesimal.

Maybe you know these authors and this repo and trust them. I don't. I'm sure they are lovely, I have no idea, I've done no research, and I've never heard of them before. That being said, if I wanted to distribute a backdoor or cryptominer to a bunch of people with powerful computers, I'd definitely hop on the AI bandwagon.

Finally just making the link non-clickable does not add security.

I disagree. Some of the commenters here are rather savvy and will properly evaluate what they are downloading. Some are not. Making a link unclickable will prevent a percentage of people from downloading. If shenanigans are discovered, someone will make a very loud comment warning folks to avoid the download. In that case some of those non-downloaders may have been saved from themselves.

Again, this wasn't a well thought out decision, but it was also a rather low impact decision, and I stand by it.

booleandilemma
1 replies
17h31m

People like the parent routinely download all of the random zip files off the web that they can get their mouse cursors on. Nothing is going to stop them.

IshKebab
0 replies
11h18m

Yep. I don't worry about non-existent threats. Nothing is going to stop me because there is no risk. Have you ever been owned by downloading a zip? Me neither.

IshKebab
0 replies
11h16m

That being said, if I wanted to distribute a backdoor or cryptominer to a bunch of people with powerful computers, I'd definitely hop on the AI bandwagon.

And write and entire novel research paper and open source the code and put it on GitHub? No you wouldn't. Don't be ridiculous.

seabass-labrax
0 replies
21h0m

On which operating systems can Zip files automatically self-execute? Android .APKs come to mind, although in this case, Android asks you whether you want to install the application and thus gives you a chance to prevent the execution.

arccy
0 replies
21h21m

just downloading a zip file won't auto execute anything. and you can't meaningfully review it without downloading it, so it pretty much is security theatre

janalsncm
0 replies
21h28m

What is the threat vector of the functional https link that hxxps solves?

dotancohen
2 replies
23h29m

What does allergic mean in this context?

peddling-brink
1 replies
23h4m

That file could contain anything. I don't know the authors or have any idea of their reputation.

I wanted to expose it so people didn't have to comb through the github, but decided to make it unclickable out of an abundance of caution. This appears to have offended people.

I would not have hesitated to link to hugging face. That is a known quantity.

chrisweekly
0 replies
21h22m

FWIW I appreciate the courtesy and context; agreed that it's not the best idea to link directly to zip files (let alone those of questionable provenance).

gnfargbl
15 replies
1d1h

I had to phone my bank, which is one of the bigger players in the UK high street market, a couple of days ago. They're still encouraging me to enroll in their idiotic "my voice is my password" programme. At this stage in the evolution of AI, that feels simply negligent.

toss1
13 replies
1d1h

Fidelity Investments just did something even worse ~a week ago - It asked me to reply to a few questions, then announced that I'd just been enrolled in it's voice identification program (or whatever they call it).

Now I've got Just Another Item on my ToDo list, to get that undone. Gawd, does every company promote it's stupidest people to management?

ben_w
9 replies
23h47m

I don't know if GDPR (or any of its cousins) applies to you, but this kind of thing sounds exactly like the sort of thing it's supposed to outlaw.

jokethrowaway
8 replies
23h23m

How? Your bank stores personal data covered by gdpr but enabling crappy secure systems is not the domain of gdpr.

Most likely this is caused by SCA another European directive that ruined our lives with extra security hoops (for payment providers) for little extra security - or even worse in case of voice password or security questions

toss1
3 replies
16h25m

Sadly, the US, where I currently live, is quite behind in this. Considering going expat (not for this specific reason, but it doesn't help); any expats have recommendations of what countries have worked well for them?

jokethrowaway
1 replies
7h33m

Heavily depends on money available / crime level tolerance - albeit crime level in the US is pretty crazy.

Europe is deteriorating incredibly rapidly in terms of crime (due to a combination of economic poverty and uncontrolled immigration from third world countries) - but I think some of the low tax EU countries (Malta, Cyprus, Gibraltar, etc) are a good bet for a few more years.

My top choice if I had family (or friends I want to be close with) in the US would be Cayman.

I think long term, either South America drops the level of crime considerably and becomes the new place to be or China start building futuristic cities attracting wealthy western talent to offset their declining population rate.

ben_w
0 replies
6h58m

Gibraltar

That's a British Overseas Territory, it isn't in the EU.

ben_w
0 replies
6h48m

I recommend you first narrow it down to somewhere whose main language you can speak. I picked Germany because I already had some experience with the language and slightly Dunning Kruger'd myself. I like it, but… well, even native German speakers say „Deutsche Sprache, schwere Sprache“ ("German is hard").

Cyprus has a lot of English speakers (and indeed a lot of street furniture that looks just like the UK, plus two UK airforce bases[0]), but the national language is Greek… I don't know if I'd risk that, given the one time I tried to ask for «Ένα σάντουιτς και ένα τσάι παρακαλώ»[1] in Athens[2], the person behind the counter replied in English to correct my pronunciation.

[0] https://en.wikipedia.org/wiki/Akrotiri_and_Dhekelia

[1] https://translate.google.com/?sl=el&tl=en&text=Ένα%20σάντουι...

[2] I know that's not in Cyprus, but it is, as you may guess, another place where Greek is the national language.

ben_w
3 replies
22h59m

A person's voice is, I believe, personal data.

Processing personal data is generally prohibited, unless it is expressly allowed by law, or the data subject has consented to the processing

- https://gdpr-info.eu/issues/consent/

jokethrowaway
1 replies
7h30m

I see, good point!

Given the same voice get processed and recorded during a normal phone call to the bank so you would need to give consent just to talk on the phone (and they do have a disclaimer when you are calling in Europe).

Most likely this is buried deep in some massive EULA you accept when you open an account.

ben_w
0 replies
7h6m

All processing is supposed to require explicit and meaningfully informed consent for each separate use; one of the GDPR training lessons we get over here is basically "Bob has a bunch of customer's emails he got from the sign up process, is he allowed to use them to send adverts for a new product?" and the answer is "No, that's only allowed when the customers explicitly consented to that, you can't just use any data they happen you have given you for whatever new purpose you want".

EULAs are a bit more of a mess, as all the advice I've been given says "don't hide stuff like that" while all the websites I visit are "we're going to do this anyway because we think we can get away with it".

seabass-labrax
0 replies
19h45m

Importantly, you can also revoke consent at any time under the GDPR. Unlimited consent isn't possible, so the bank would have to make the (dubious) claim that such processing did not require permission at all.

throwup238
0 replies
1d

> Gawd, does every company promote it's stupidest people to management?

Yes: https://en.wikipedia.org/wiki/Dilbert_principle

Ironically, this is the place where they can do the least damage.

hasty_pudding
0 replies
1d1h

They promote their best schmoozers to management.

They have so much money that competence no longer matters and bootlicking will get you much farther.

crazysim
0 replies
1d1h

Clone management's voices and post it to their social media/etc. Super undo it!

Havoc
0 replies
22h20m

Investec? Yeah thinking I need to phone them to disable mine

z991
14 replies
23h48m

I commend the authors on making this easy to try! However it doesn't work very well for me for general voice cloning. I read the first paragraph of the wikipedia page on books and had it generate the next sentence. It's obviously computer generated to my ear.

Audio sample: https://storage.googleapis.com/dalle-party/sample.mp3

Cloned voice (converted to mp3): https://storage.googleapis.com/dalle-party/output_en_default...

All I did was install the packages with pip and then run "demo_part1.ipynb" with my audio sample plugged in. Ran almost instantly on my laptop 3070 Ti / 8GB. (Also, I admit to not reading the paper, I just ran the code)

dijksterhuis
5 replies
19h58m

It's obviously computer generated to my ear.

From the README

    Disclaimer

    This is an open-source implementation that approximates the performance of the internal voice clone technology of myshell.ai. The online version in myshell.ai has better 1) audio quality, 2) voice cloning similarity, 3) speech naturalness and 4) computational efficiency.

uoaei
3 replies
17h20m

So this paper is a thinly veiled ad of myshell.ai's services?

ametrau
1 replies
15h16m

Yes. And I used myshell.ai out of interest. It’s also absolutely terrible.

dvfjsdhgfv
0 replies
1h36m

I came here just for your comment. Thank you for doing this work so the rest of us doesn't have to.

gmerc
0 replies
10h40m

Like 50% of arxiv. SV figured out that people read papers in 202x, not PRNewsWire and have adjusted accordingly.

3abiton
0 replies
10h6m

Not totally unexpected unfortunately. Any other OSS players on the market?

thorum
3 replies
21h40m

My experience with other tools like xtts is you really need to have a studio-quality voice sample to get the best results.

amluto
1 replies
21h29m

The most obvious problem to my ears is the syllable timing and inflection of the generated speech, and, intuitively, this doesn’t seem like a recording quality issue. It’s as if it did a mostly credible job of emulating the speaker trying to talk like a robot.

hwillis
0 replies
21h21m

The biggest trip-up is the pronunciation of "prototypically", and you had "typically" in your original. Maybe it's overfitting to a stilted proto-typically? Could try with a different, less similar sentence

nxobject
0 replies
14h31m

That might be the next big contribution – performance in perceptually catching the features of a not-so-good recording – for example, with a webcam style microphone.

pclmulqdq
2 replies
23h19m

Looking at the website and the examples, it's pretty clearly set up to make stylized anime voices.

japanman185
1 replies
22h5m

This is the driver for a lot of things. Anime. x264 was to enable better compression of weeb videos. This tech will allow fan dubs to better represent the animes in the videos.

matheusmoreira
0 replies
4h2m

Anime also drove the development of a lot of subtitling technology if I remember correctly.

fbdab103
0 replies
23h13m

Thanks for the real example. Sounded quite generated to my ear as well. Wonder if it can do any better with more source material.

anotherevan
11 replies
20h8m

Is it possible to use this (or Eleven Labs) to generate a voice model to plug into an Android phone's TTS?

I have a friend with a paralysed larynx who is often using his phone or a small laptop to type in order to communicate. I know he would love it if it was possible to take old recordings of him speaking and use that to give him back "his" voice, at least in some small measure.

refulgentis
4 replies
18h2m

Alas, no (made some contributions to TTS at G, and worked on Android).

iOS has this built in :/ which may bode well, there's no greater Google product manager than "whatever apple just shipped."

I'm doing some xplatform on device inference stuff (see FONNX on GitHub) and it'll be one of 100 items that'll stick on my mind for a while, I hope I find time and I'll try to ping you

Edit: is an Android app with a keyboard and "speak" button that does API calls to eleven labs sufficient for something worth trying?

anotherevan
3 replies
17h44m

I'll try to ping you

Thanks. Use the email address in my profile if anything eventuates.

is an Android app with a keyboard and "speak" button that does API calls to eleven labs sufficient for something worth trying?

Maybe. Obviously something with local processing would be preferred, but it might be an option when internet connectivity is good. Is there such an app?

refulgentis
2 replies
15h11m

There isn't an ElevenLabs app like that, but I think that's the most expedient method, by far. (i.e. O(days) instead of O(months))

(warning: detailed opinionated take, I suggest skimming)

Why? Local inference is hard. You need two things: the clips to voice model (which we have here, but bleeding edge), and text + voice -> speech model.

Text to voice to speech, locally, has excellent prior art for me, in the form of a Raspberry Pi-based ONNX inference library called [Piper](https://github.com/rhasspy/piper). I should just be able to copy that, about an afternoon of work! :P

Except...when these models are trained, they encode plaintext to model input using a library called eSpeak.

eSpeak is basically f(plaintext) => ints representing phonemes.

eSpeak is a C library and written in a style I haven't seen in a while and depends on other C libraries. So I end up needing to port like 20K lines of C to Dart...or I could use WASM, but over the last year, I lost the ability to be able to reason through how to get WASM running in Dart, both native and web.

Re: ElevenLabs

I had looked into the API months ago and vaguely remembered it was _very_ complete.

I spent the last hour or two playing with it, and reconfirmed that. They have enough API surface that you could build an app that took voice recordings, created a voice, and then did POSTs / socket connection to get audio data from that voice at will.

Only issue is pricing IMHO, $0.18 for 1000 characters. :/ But this is something I feel very comfortable saying wouldn't be _that_ much work to build and open source with a "bring your own API key" type thing.

I had forgotten about Eleven Labs till your post, which made me realize there was an actually meaningful and quite moving use case for it. All of Elevens advantages (cloning, peak quality by a mile) come into play, and the disadvantages are blunted: local voice cloning isn't there yet, and $0.18 / 1000 characters doesn't matter as much when it's interpersonal exchanges instead of long AI responses

l-albertovich
1 replies
3h22m

Wouldn't it be better to use FFI to build an idiomatic interface to use in Dart instead?

refulgentis
0 replies
2h32m

It's a good point but I'm a perfectionist and can't abide without a web version.

though, now that I write that...

Native: FFI.

Web: Dart calling simple JS function, and the JS handles WASM.

...is an excellent sweet spot. Matches exactly what I do with FONNX. The trouble with WASM is Dart-bounded.

(n.b. re: local cloning for anyone this deep, this would allow local inference of the existing voices in the Raspberry Pi x ONNX voicer project above. It won't _necessarily_ help with doing voice cloning locally, you'll need to prove out that you can get a voice cloning model in ONNX to confirm.)

(n.b. re: translating to Dart, I think the only advantage of a pure Dart port would be memory safety stuff but I also don't think a pure Dart port is feasible without O(months) of time. The C is...very very very 2000s C. globals in one file representing current state that 3 other files need to access. array of structs formed by just reading bytes from a file at runtime that matches the struct layout)

klankbrouwerij
4 replies
18h46m

Your friend can take a look at solutions from Acapela [0], SpeakUnique [1] or VOCALiD [2]. Not sure whether they have a solution for Android though.

I recently saw a video from google about a custom voice they created for somebody with ALS but I can't seem to find it online (Does anybody have a link?). Creating custom voices is not yet available on Android though. The latest iOS release (iOS 17) does support creating personalized voices.

ModelTalker [3] is a long-term (research?) project to create custom voices for people with speech disabilities. Their TTS seem to support Android so that might be another option.

[0] https://www.acapela-group.com/ [1] https://www.speakunique.co.uk/ [2] https://vocalid.ai/ [3] https://www.modeltalker.org/

anotherevan
3 replies
17h27m

Hi and thanks for the suggestions. Looking through them, it looks like you need to do what they call "voice banking" before you lose your voice. Basically reading a script they provide.

Unfortunately my friend's voice is too far gone for that to be possible. Hoping for something where they can use old recordings to generate a voice.

jokethrowaway
0 replies
7h41m

From my tests I think Audiobox from Meta is the most promising (even better than Eleven Labs) - too bad it's closed source and they force you to read some randomly generated sentences (to prevent the case of someone generating a cloned voice without consent).

Right now Eleven Labs is your best bet.

xTTS is just not there quality wise. The version available in the studio is marginally better than the OSS version but it's still pretty far from being believable.

The non-nerfed version of Tortoise (the author decided to ruin their own project but forks exist) was decent at voice cloning but it takes a lot of tries.

I'm pretty sure we already have the technology to do what you want and help your friend, it's just a matter of time until it gets better and more software comes out.

droopyEyelids
0 replies
16h54m
cjbprime
0 replies
16h51m

Some of the recent transformer models can work with audio clips just a few seconds long. I'm sure the final output is less good, but perhaps your friend has audio clips that would work for that from e.g. home movies.

Share6323
0 replies
19h29m

That would be awesome

RagnarD
10 replies
22h7m

My first and ongoing thought is that immoral/criminal uses of voice cloning vastly exceed any legitimate ones.

squigz
3 replies
20h20m

Out of curiosity, what/how many legitimate use cases have you considered?

RagnarD
1 replies
15h45m

A legitimate use, in the abstract, is one where a particular individual is willing to have their voice used to say X. The entertainment industry - movies and games - are likely to want this.

But if it's trivial to use somebody's voice to say any arbitrary thing, then it'll be done. Combined with deepfake videos, the result will be the ability to show anyone saying anything, including lies and things they find incredibly objectionable, in a disturbingly realistic way, and more so as time wears on.

The fundamental issue is that we don't live in a rights-respecting world. Making it easy to utter anything in the voice of anyone will lead to many more abuses than legitimate instances.

EnigmaFlare
0 replies
12h32m

People will get immune to it if they aren't already. It's already common to fake screenshots of tweets/etc. Not a real problem unless you want to beleive falsehoods, then you will anayway.

bryanrasmussen
0 replies
3h44m

Potential legitimate uses I can think of -

1. licensing voice to other uses - people with recognizable trademarkable voices (actors, singers) have another potential revenue stream. yay!

2. use of past voices - voices that are not 'owned' from the past - let's say Humphrey Bogart's voice, can be used in projects without having to pay for imitator. This would be useful for both marketing and artistic projects. But probably less for marketing because they will want to go with step 1.

3. Teach yourself to talk like X. People who need to learn to talk like a particular person / have a particular accent could learn quicker. Just think - you will be able to supplement your comedy routine with kickass Christopher Walken impersonations any day now!

Variations of 3 and 2 together open up interesting modes of aesthetic impression, but I won't go into that here. But definitely I have some ideas that might benefit from being able to do this.

CaptainFever
1 replies
17h2m

My first thought is anonymity. I can make YouTube videos without needing to use my real voice... while being able to keep my personal inflections and emphasis, something TTS (AI) voices can't do.

Or...! Indie game development. I can learn basic voice acting (to get rid of the cringe), and act out all of my characters using different voices.

throwaway17_17
0 replies
16h50m

The indie game development and animated short content are the primary uses for this type of NN for me. I’m working (not very successfully) at putting together a single source voice — to many result voice ‘style transfer’ solution using standard PyTorch components. Realistically I can pay for the target sample voice to record some amount of varied vocal performance and then hopefully if the net is trained specifically on my voice as the source the hope is the transfer can capture the ‘performance’ qualities in my original.

And in case anyone is concerned, I intend to make the purpose of the vocal samples clear to the provider and then arrange appropriate credit and compensation to those whose voices I used. I also don’t intend to train with anything but public domain and purchased data.

tacocataco
0 replies
3h6m

Talk with your loved ones and make a paraphrase for if you're stuck in a emergency and need money wired or something.

Some banks have voice authentication when you call in and you have to ask to opt out.

jokethrowaway
0 replies
7h28m

I disagree, we should just not accept voice as authentication.

I think the most common use case will be making art & content programmatically without voice actors (and most likely without actors at all once we nail video or a 3d model pipeline + frame by frame transformation to make it look realistic)

graphe
0 replies
20h9m

What of commercial uses being greater than illegitimate ones? YouTube will give people the ability to hear it in their own localized language in the author's voice.

airstrike
0 replies
20h30m

Which just means we need to build protocols around this risk, rather than foolishly trying to shove the genie back in the bottle, lest we be left with only the criminal uses

iAkashPaul
5 replies
1d1h

That watermark detection rights at the end is real sus

diggan
3 replies
1d1h

What exactly are you talking about? The paper doesn't mention any watermark at all, as far as I can see/search.

cwillu
2 replies
1d1h

The readme on the linked github reads: “MyShell reserves the ability to detect whether an audio is generated by OpenVoice, no matter whether the watermark is added or not.”

lostlogin
1 replies
1d1h

As you say, right at the bottom https://github.com/myshell-ai/OpenVoice

diggan
0 replies
1d

Ah, thank you. Guess that's OK that the company/service do whatever they want, the paper/technique doesn't involve watermarks, so it'd be easy to remove/modify whatever they do in the library/service itself.

fbdab103
0 replies
23h8m

At least right now, there is a literal add_watermark function, so probably easy enough to remove that surface level. Unless they added something cute to the training data to poison the well.

https://github.com/myshell-ai/OpenVoice/blob/a33963c3d764bee...

senthilnayagam
3 replies
1d

current leader in open source voice cloning is RVC, would like to see how it compares to it.

echelon
2 replies
22h32m

RVC is voice conversion (audio to audio), and it's typically finetuned.

This is zero shot TTS. Samples create vector encodings that serve as input to inference. There's no retraining the model unless you want it to generalize or perform better.

cchance
1 replies
18h56m

It isn't though, people need to read the paper and the comments from the author they aren't actually doing the voice generation they pass the text off to VITS, and then they're sauce is that they are doing tone mapping on that VITS output, so if anything they're a competitor to RVC, it's just that the version they published includes VITS also

echelon
0 replies
17h32m

Interesting.

Funny enough, a lot of RVC packages are using VITS to do RVC for TTS.

qwertox
2 replies
1d

And suddenly it becomes a bit weird:

https://docs.myshell.ai/tokenomics

Tokenomics

Disclaimer: MyShell is currently in the testing phase, and the content of the whitepaper may be subject to change in the future.

$SHELL is the token used for user incentive, governance and in-app utility.

The total supply of $SHELL is 1,000,000,000

diggan
1 replies
1d

And luckily, this submission seems to be about the paper/technology OpenVoice, not about the company MyShell (whatever that is).

qwertox
0 replies
1d

License[0]: This repository is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which prohibits commercial usage. MyShell reserves the ability to detect whether an audio is generated by OpenVoice, no matter whether the watermark is added or not.

[0] https://github.com/myshell-ai/OpenVoice

andrewstuart
2 replies
16h3m

If I wanted to do voice cloning on my own hardware can anyone suggest what a good open source project would be to use? What is the state of the art in open source voice cloning?

snerbles
1 replies
15h19m

I use Tortoise TTS. It's slow, a little clunky, and sometimes the output gets downright weird. But it's the best quality-oriented TTS I've found that I can run locally.

It's allegedly the basis of the tech used by Eleven Labs.

https://github.com/neonbjb/tortoise-tts

xsdu
0 replies
9h33m

There are faster implementations of tortoise that allow fine-tuning. You can get close to ElevenLabs quality if you have a perfect dataset. https://git.ecker.tech/mrq/ai-voice-cloning

whycome
1 replies
12h8m

It's not really well advertised and I'm not sure Apple is continuing development, but iOS has a voice clone feature called "Personal Voice" - it takes about 15 mins to train it with your own voice (and then takes a few hours to process on-device when locked). You can use it in phone calls and FaceTime (maybe other places?). It would be nice to use it for general TTS.

sagz
0 replies
10h30m

It's an accessibility feature for people losing their voice or on the verge of. And it is TTS only, not speech-to-speech as your mention of "can use it in phone calls and facetime" implies. Not being s2s means it doesn't retain vocal disfluencies, prosody etc signals that make a voice feel real

monkeydust
1 replies
23h38m

So I guess we could (legally) now create a voice chatbot using Mickey Mouse audio from Steamboat Willie?

andylynch
0 replies
23h30m

Possibly, except there is no dialogue in it.

dcreater
1 replies
1d2h

Any GitHub link?

saeedesmaili
0 replies
1d2h
chipper02
1 replies
16h44m

Now of only youtube would ban the use of this crap. Or at the very least allow you filter those videos.

whywhywhywhy
0 replies
6h2m

There's genuine uses, look at Apple offering this tech recently as an accessibility feature for people losing the ability to speak to have text to speech in their own voice in lieu of being able to vocalize it themselves.

You're banning genuine uses like that or just creators who want to fix a fumbled or awkward line without completely re-recording if you ban it.

yboris
0 replies
31m
windex
0 replies
11h22m

Fraud becomes easier I guess.

tremarley
0 replies
23h38m

Their Tokenomics page say

$SHELL is the token used for user incentive, governance and in-app utility.

The total supply of $SHELL is 1,000,000,000

Team, Treasury, Advisors & Private Sale = 55% allocation

Community Incentive = 40% allocation

Liquidity = 5%

thimkerbell
0 replies
19h13m

Overall, will this be a good thing or a bad thing for society, do you think?

If it is a bad thing, should we cheer it on?

starwin1159
0 replies
23h4m

I hope someone can handle Cantonese one day

smellf
0 replies
1d2h

Examples: https://research.myshell.ai/open-voice

Seems impressive!

programjames
0 replies
1d

I love this paper. It reads very much like "this is what we did, and we want to help others do it too." Also, the section "Remark on Novelty" is golden: "OpenVoice does not intend to invent the submodules in the model structure ... The contribution of OpenVoice is the decoupled framework that seperates the voice style and language control from the tone color cloning." They don't try to hype up their contribution.

pclmulqdq
0 replies
1d

Wonderful company, not a scam at all: https://docs.myshell.ai/tokenomics

kennethologist
0 replies
3h27m

Anyone knows at a deeply practical and technical level how 11Labs achieves the level they do?

ijhuygft776
0 replies
20h9m

Is there some similar software that allows you to add lets say 40 years to a voice?

hasty_pudding
0 replies
1d1h

Holy cow! If this works without curated audio...this is amazing!

cwillu
0 replies
1d1h

From the github readme:

“MyShell reserves the ability to detect whether an audio is generated by OpenVoice, no matter whether the watermark is added or not.”

Call me skeptical…

SubiculumCode
0 replies
1d

whats with the crypto thing?