return to table of content

StyleTTS2 – open-source Eleven-Labs-quality Text To Speech

modeless
45 replies
18h45m

I made a 100% local voice chatbot using StyleTTS2 and other open source pieces (Whisper and OpenHermes2-Mistral-7B). It responds so much faster than ChatGPT. You can have a real conversation with it instead of the stilted Siri-style interaction you have with other voice assistants. Fun to play with!

Anyone who has a Windows gaming PC with a 12 GB Nvidia GPU (tested on 3060 12GB) can install and converse with StyleTTS2 with one click, no fiddling with Python or CUDA needed: https://apps.microsoft.com/detail/9NC624PBFGB7

The demo is janky in various ways (requires headphones, runs as a console app, etc), but it's a sneak peek at what will soon be possible to run on a normal gaming PC just by putting together open source pieces. The models are improving rapidly, there are already several improved models I haven't yet incorporated.

lucubratory
20 replies
18h40m

How hard on your end does the task of making the chatbot converse naturally look? Specifically I'm thinking about interruptions, if it's talking too long I would like to be able to start talking and interrupt it like in a normal conversation, or if I'm saying something it could quickly interject something. Once you've got the extremely high speed, theoretically faster than real time, you can start doing that stuff right?

There is another thing remaining after that for fully natural conversation, which is making the AI context aware like a human would be. Basically giving it eyes so it can see your face and judge body language to know if it's talking too long and needs to be more brief, the same way a human talks.

modeless
19 replies
18h31m

Yes, I implemented the ability to interrupt the chatbot while it is talking. It wasn't too hard, although it does require you to wear headphones so the bot doesn't hear itself and get interrupted.

The other way around (bot interrupting the user) is hard. Currently the bot starts processing a response after every word that the voice recognition outputs, to reduce latency. When new words come in before the response is ready it starts over. If it finishes its response before any more words arrive (~1 second usually) it starts speaking. This is not ideal because the user might not be done speaking, of course. If the user continues speaking the bot will stop and listen. But deciding when the user is done speaking, or if the bot should interrupt before the user is done, is a hard problem. It could possibly be done zero-shot using prompting of a LLM but you'd want a GPT-4 level LLM to do a good job and GPT-4 is too slow for instant response right now. A better idea would be to train a dedicated turn-taking model that directly predicts who should speak next in conversations. I haven't thought much about how to source a dataset and train a model for that yet.

Ultimately the end state of this type of system is a complete end-to-end audio-to-audio language model. There should be only one model, it should take audio directly as input and produce audio directly as output. I believe that having TTS and voice recognition and language modeling all as separate systems will not get us to 100% natural human conversation. I think that such a system would be within reach of today's hardware too, all you need is the right training dataset/procedure and some architecture bits to make it efficient.

As for giving the model eyes, actually there are already open source vision-language models that could be used for this today! I'd love to implement one in my chatbot. It probably wouldn't have social intelligence to read body language yet, but it could definitely answer questions about things you present to the webcam, read text, maybe even look at your computer screen and have conversations about what's on your screen. The latter could potentially be very useful, the endgame there is like GitHub Copilot for everything you do on your computer, not just typing code.

fintechie
4 replies
17h16m

although it does require you to wear headphones so the bot doesn't hear itself and get interrupted.

Maybe you can use some sort of speaker identification to sort this out?

https://github.com/openai/whisper/discussions/264

woodson
2 replies
12h5m

A simple correlation of audio chunks from microphone and from the TTS should be enough to tell which parts in the input stream are re-recorded TTS. Much simpler, no?

modeless
1 replies
11h52m

It's not so simple when the impulse response of the room and mic and speakers are all unknown, possibly changing, plus unknown background sounds as well, possibly at a very high level. and there's also unknown latency which can be quite large especially in the networked case, and maybe some codecs, and maybe some audio "enhancement" software the OEM installed on the user's machine. Also, ideally the computer would be able to hear the user even while it is speaking.

Echo cancellation is non-trivial for sure.

IanCal
0 replies
5h50m

Could it be done more reasonably with the transcription? With diarisation the logic of "someone is saying exactly / almost exactly what I'm saying, that's probably me" might be pretty reasonable.

modeless
0 replies
15h20m

Yes, this is a good idea. Too many good ideas, not enough time!

joshspankit
3 replies
15h33m

Short-term could it be configured as push to talk?

modeless
2 replies
15h31m

Certainly, but then it has little advantage over e.g. ChatGPT voice mode. I guess running locally is an advantage but the voice and answer quality is worse. The much better latency and more natural conversation is what I like about it.

joshspankit
1 replies
13h59m

Am I wrong to think it would have a couple major advantages? Like using speakers without having to worry about echo cancellation, having a distinct interrupt signal, and still getting all the latency benefits (possibly even more once you get used to it since the conversational style has to assume the end of the user’s sentence instead of knowing the second they let go of the button)

modeless
0 replies
13h30m

You wouldn't quite have all the latency benefits because you'd have the additional delay between when you stop speaking and when you release the button (or cut off speech if you release too early). It wouldn't respond any faster because it's already responding at the fastest possible speed right now, it doesn't wait at all. And it wouldn't be hands free, it wouldn't feel like a natural conversation which is what I'm going for.

I'd rather use speaker diarization and/or echo cancellation to solve the problem without needing the user to press any buttons.

globalnode
3 replies
17h30m

you'd have to do something along the lines of what voice comm does to combat the output feedback problem. i think it involves an fft to analyse the two signals and cancel out the feedback, im not 100% sure on the details.

modeless
2 replies
15h37m

I plan to change the audio input to use WebRTC, then I get echo cancellation and network transparency for free. Although dealing with WebRTC is a headache harder than doing the AI parts.

Sean-Der
1 replies
15h28m

What do you find hard about WebRTC?

I would love to help. Would even code up a prototype if you wanted :)

modeless
0 replies
15h19m

For starters, every WebRTC demo I've tried has at least 400ms of round trip latency even on a loopback connection. Shoot me an email if you know WebRTC, would be good to chat with someone who knows stuff!

taneq
1 replies
14h46m

Deciding when someone is done speaking is hard to do well and impossible to do perfectly. Some people finish speaking, then think of something else to say and pretend they were still talking.

modeless
0 replies
14h43m

True, perfection isn't achievable but human level performance is all you need and it may be possible to do better than that.

slow_numbnut
1 replies
9h21m

Instead of sacrificing flexibility by building one monolith model that does Audio to audio in one go, wouldn't it be better to train a model that handles conversing with the user (knows when the user is done talking, when it's hearing itself, etc) and leave the thinking to other, more generic models?

modeless
0 replies
9h17m

You don't lose flexibility with an end to end model. You lose controllability. But there are ways to mitigate that.

lucubratory
0 replies
17h20m

Thanks, fascinating insights. I think an everything-to-everything multimodal model could work if it's big enough because of transfer learning (but then there are latency issues), and so could a refined system built on LLMs/LMMs with TTS (like what you are using), but I haven't seen any good research on audio-to-audio language models. My suspicion is that that would take a lot of compute, much more than text, and that the amount of semantically meaningful accessible data might be much lower as well. And if you do manage to get to the same level of quality as text, what is latency like then? Not 100% sure, just intuitions, but I doubt it's great.

I like the idea of an RL predictor for interruption timing, although I think it might struggle with factual-correction interruptions. It could be a good way to make a very fast system, and if latency on the rest of the system is low enough you could probably start slipping in your "Of course", "Yeah, I agree", and "It was in March, but yeah" for truly natural speech. If latency is low you could just use the RL system to find opportunities to interrupt, give them to the LLM/LMM, and it decides how to interrupt, all the way from "mhm", to "Yep, sounds good to me", to "Not quite, it was the 3rd entry, but yeah otherwise it makes sense", to "Actually can I quickly jump on that? I just wanted to quickly [make a point]/[ask a question] about [some thing that requires exploration before the conversation continues]".

Tuning a system like this would be the most annoying activity in human history, but something like this has to be achieved for truly natural conversation so we gotta do it lol.

generalizations
0 replies
2h37m

It could possibly be done zero-shot using prompting of a LLM

That's how I've been thinking of doing it - seemed like you could use a much smaller GPT-J-ish model for that, and measure the relative probability of 'yes' vs 'no' tokens in response to a question like 'is the user done talking'. Seemed like even that would be orders of magnitude better than just waiting for silence.

eigenvalue
5 replies
16h56m

Tried it but it seems it only works with Cuda 11 and I have 12 installed. Not really willing to potentially screw up my Cuda environment to try it.

modeless
3 replies
15h48m

Thanks for trying, what error message did you get? It works without CUDA installed at all on my test machine.

eigenvalue
2 replies
15h7m

  Process Process-2:
  Traceback (most recent call last):
    File "multiprocessing\process.py", line 314, in _bootstrap
    File "multiprocessing\process.py", line 108, in run
    File "chirp.py", line 126, in whisper_process
    File "chirp.py", line 126, in <listcomp>
    File "faster_whisper\transcribe.py", line 426, in generate_segments
    File "faster_whisper\transcribe.py", line 610, in encode
  RuntimeError: Library cublas64_11.dll is not found or cannot be loaded
  tts initialized

modeless
1 replies
14h53m

Hmm, the dll is included in the app package but maybe there is a conflict with other installed DLLs on some machines. When releasing PC software I always expect this type of issue unfortunately. I plan to move away from faster_whisper which may fix this.

I have to say that the Python ecosystem is just awful for distribution purposes and I spent a lot longer on packaging issues than I did on the actual AI parts. And clearly didn't find all of the issues :)

eigenvalue
0 replies
14h37m

Agree completely. But in this case the fault is with CUDA which never ever works without a struggle. It’s insane how hard it is to get stuff that works cross-platform without a lot of work using CUDA. Even PyTorch has an awkward way of dealing with it and they have more resources to figure it out than just about anyone.

nmstoker
0 replies
3h30m

Using a conda environment should be able to get around that I believe

tomp
4 replies
17h41m

How do you get Whisper to be fast?

Isn't it quite non-realtime?

TOMDM
1 replies
17h33m

The community upgrades to whisper are far faster than real-time, especially if you have a powerful gpu

Jach
0 replies
4h33m

There's several faster ones out there. I've been using https://github.com/Softcatala/whisper-ctranslate2 which includes a nice --live_transcribe flag. It's not as good as running it on a complete file but it's been helpful to get the gist of foreign language live streams.

wahnfrieden
0 replies
17h25m

use whisper-distil, it's like 5-8x faster

modeless
0 replies
15h44m

Great question! Whisper processes audio in 30 second chunks. But on a fast GPU it can finish in only 100 milliseconds or so. So you can run it 10+ times per second and get around 100ms latency. Even better actually because Whisper will predict past the end of the audio sometimes.

This is an advantage of running locally. Running whisper this way is inefficient but I have a whole GPU sitting there dedicated to one user, so it's not a problem as long as it is fast enough. It wouldn't work well for a cloud service trying to optimize GPU use. But there are other ways of doing real time speech recognition that could be used there.

funtech
4 replies
14h15m

Is 12GB the minimum? got an out of memory error with 8GB

modeless
3 replies
14h12m

Yes, unfortunately these models take a lot of VRAM. It may be possible to do an 8GB version but it will have to compromise on quality of voice recognition and the language model so it might not be a good experience.

joshspankit
2 replies
13h56m

This might be silly because of how few people it benefits, but could it be broken up on to multiple 8GB cards on the same system?

modeless
1 replies
13h33m

Yes, it absolutely could. You're right that this configuration is rare. Although people have been putting together machines with multiple 24GB cards in order to split and run larger models like llama2-70B.

wahnfrieden
0 replies
10h45m

The latest large models are 120B and 100k context such as Goliath and Tess XL

xena
1 replies
17h11m

It threw a python exception for me and didn't generate speech

modeless
0 replies
15h48m

Thanks for trying, what exception did you get?

shon
1 replies
11h59m

Cool work! I tested it and got some mixed results:

1) it throws an error if it's installed to any drive other than C:\ --I moved it to C: and it works fine.

2) I'm seeing huge latency on an EVGA 3080Ti with 12GB. Also seeing it repeat the parsed input, even though I only spoke once, it appears to process the same input many times with slightly different predictions sometimes. Here's some logs:

Latency to LLM response: 4.59 latency to speaking: 5.31 speaking 4: Hi Jim! user spoke: Hi Jim. user spoke recently, prompting LLM. last word time: 77.81 time: 78.11742429999867 latency to prompting: 0.31

Latency to LLM response: 2.09 latency to speaking: 3.83 speaking 5: So what have you been up to lately? user spoke: So what have you been up to lately? user spoke recently, prompting LLM. last word time: 83.9 time: 84.09415280001122 latency to prompting: 0.19 user spoke: So what have you been up to lately? No, I'm watching. user spoke a while ago, ignoring. last word time: 86.9 time: 88.92142140000942 user spoke: So what have you been up to lately? No, just watching TV. user spoke a while ago, ignoring. last word time: 87.9 time: 90.76665070001036 user spoke: So what have you been up to lately? No, I'm just watching TV. user spoke a while ago, ignoring. last word time: 87.9 time: 94.16581820001011 user spoke: So what have you been up to lately? No, I'm just watching TV. user spoke a while ago, ignoring. last word time: 88.9 time: 97.85854300000938 user spoke: So what have you been up to lately? No, I'm just watching TV. user spoke a while ago, ignoring. last word time: 87.9 time: 101.54986060000374 user spoke: No, I just bought you a TV. user spoke a while ago, ignoring. last word time: 87.8 time: 104.51332219998585 user spoke: No, I'll just watch you TV. user spoke a while ago, ignoring. last word time: 87.41 time: 106.60086529998807 Latency to LLM response: 46.09 latency to speaking: 50.49

Thanks for posting it!

Edit:

3) It's hearing itself and responding to itself...

modeless
0 replies
11h40m

Thanks for trying it and thanks for the feedback! Yes, right now you need to use headphones so it doesn't hear itself. Sometimes Whisper inexplicably fails to recognize speech promptly. It seems to depend on what you say, so try saying something else. I have improvements that I haven't had time to release yet that should improve the situation, and a lot more work is definitely needed, this is definitely MVP level stuff right now. This stuff is fixable but it'll take time.

samsepi0l121
1 replies
16h19m

But whisper does not support input streaming, so you have to wait for the whole llm response to trigger the transcription or not?

yencabulator
0 replies
15h35m

Apparently, by running it on windows of audio very often:

https://news.ycombinator.com/item?id=38340938

aik
1 replies
14h35m

Hey modeless. Love it. Is your project open source by any chance? Would love to see it.

modeless
0 replies
14h4m

I haven't decided yet what I'm going to do with it. I think ideally I would open source it for people who have GPUs but also run it as a paid service for people who don't have GPUs. Open source that also makes money is always the holy grail :) I'll post updates on my Twitter/X account.

mlsu
17 replies
22h27m

We're now at "free, local, AI friend that you can have conversations with on consumer hardware" territory.

- synthesize an avatar using stablediffusion

- synthesize conversation with llama

- synthesize the voice with this text thing

soon

- VR

- Video

wild times!

jpeter
12 replies
22h24m

Which consumer gpu runs llama 70B?

mlsu
5 replies
21h55m

A Mac with a lot of unified RAM can do it, or a dual 3090/4090 setup gets you 48gb of VRAM.

benjaminwootton
2 replies
20h49m

I’ve got a 64gb Mac M2. All of the openllm models seem to hang on startup or on API calls. I got them working through GCP colab. Not sure if it’s a configuration issue or if the hardware just isn’t up to it?

wahnfrieden
0 replies
20h23m

Try llama.cpp with Metal (critical) and GGUF models from TheBloke

Or wait another month or so for https://ChatOnMac.com

benreesman
0 replies
20h26m

Valiant et al work great on my 64Gb Studio at Q4_K_M. Happy to answer questions.

jadbox
1 replies
21h32m

Does this actually work? I had thought that you can't use SLI to increase your net memory for the modal?

speedgoose
0 replies
19h34m

It works. I use ollama these days, with litellm for the api compatibility, and it seems to use both 24GB GPUs on the server.

brucethemoose2
4 replies
21h53m

A single 3090, or any 24GB GPU. Just barely.

Yi 34B is a much better fit. I can cram 75K context onto 24GB without brutalizing the model with <3bpw quantization, like you have to do with 70B for 4K context.

speedgoose
3 replies
19h35m

Can it produce any meaningful outputs with such an extreme quantisation?

brucethemoose2
2 replies
18h45m

Yeah, quite good actually, especially if you quantize it on text close to what you are trying to output.

Llama 70B is a huge compromise at 2.65bpw... This does make the much "dumber." Yi 34B is much better, as you can quantize it at ~4bpw and still have a huge context.

lossolo
1 replies
18h18m

How would you compare mistral-7b-instruct 16fp (or similar 7b/13b model like llama2 etc) to Yi-34b quantized?

brucethemoose2
0 replies
17h54m

34B is better. Quantization hurts some, especially in "pro" non chat use cases like RAG, but the increased parameter count makes models so much smarter in comparison.

The perplexity graph here is a pretty good illustration: https://github.com/ggerganov/llama.cpp/pull/1684

YMMV, as Mistral and Yi are not necessarily comparable like different sizes of llama, and it depends on the task.

sroussey
0 replies
22h22m

Prosumer gear.

MacBook Pro M3 Max.

trafficante
0 replies
21h14m

Seems like a fun afternoon project to get this hooked into one of the Skyrim TTS mods. I previously messed around with elevenlabs, but it had too much latency and would be somewhat expensive long term so I’m excited to try local and free.

I’m sure I have a lot of reading up to do first, but is it a safe assumption that I’d be better served running this on an m2 mbp rather than tax out my desktop’s poor 3070 running it on top of Skyrim VR?

imiric
0 replies
17h20m

I'm looking forward to this tech being used in video games, as well as generative models in general. Interacting with smart NPCs will make everyone's experience different. The avatars themselves could be dynamically generated, and entire environments for that matter. Truly game changing technology for interactive entertainment.

cloudking
0 replies
21h33m

Would be great to have a local home assistant voice interface with this + llama + whisper.

Hamcha
0 replies
22h10m

Yup, and you can already mix and match both local and cloud AIs with stuff like SillyTavern/RealmPlay if you wanna try what the experience is like, people have been using it to roleplay for a while.

lhl
16 replies
22h3m

I tested StyleTTS2 last month, my step-by-step notes that might be useful for people doing local setup (not too hard): https://llm-tracker.info/books/howto-guides/page/styletts-2

Also I did a little speed/quality shootoff with the LJSpeech model (vs VITS and XTTS). StyleTTS2 was pretty good and very fast: https://fediverse.randomfoo.net/notice/AaOgprU715gcT5GrZ2

kelseyfrog
13 replies
21h45m

inferences at up to 15-95X (!) RT on my 4090

That's incredible!

Are infill and outpainting equivalents possible? Super-RT TTS at this level of quality opens up a diverse array of uses esp for indie/experimental gamedev that I'm excited for.

refulgentis
9 replies
21h6m

Not sure what you mean: If you mean could inpainting and out painting with image models be faster, its a "not even wrong" question, similar to asking if the United Airlines app could get faster because American Airlines did. (Yes, getting faster is an option available to ~all code)

If you mean could you inpaint and outpaint text...yes, by inserting and deleting characters.

If you mean could you use an existing voice clip to generate speech by the same speaker in the clip, yes, part of the article is demonstrating generating speech by speakers not seen at training time

pedrovhb
5 replies
20h49m

I'm not sure I understand what you mean to say. To me it's a reasonable question asking whether text to speech models can complete a missing part of some existing speech audio, or make it go on for longer, rather than only generating speech from scratch. I don't see a connection to your faster apps analogy.

Fwiw, I imagine this is possible, at least to some extent. I was recently playing with xtts and it can generate speaker embeddings from short periods of speech, so you could use those to provide a logical continuation to existing audio. However, I'm not sure it's possible or easy to manage the "seams" between what is generated and what is preexisting very easily yet.

It's certainly not a misguided question to me. Perhaps you could be less curt and offer your domain knowledge to contribute to the discussion?

Edit: I see you've edited your post to be more informative, thanks for sharing more of your thoughts.

refulgentis
4 replies
19h3m

It imposes a cost on others when when you makes false claims like I said or felt the question was unreasonable.

I didn't and don't.

It is a hard question to understand and an interesting mind-bender to answer.

Less policing of the metacontext and more focusing on the discussion at hand will help ensure there's interlocutors around to, at the very least, continue policing.

IshKebab
3 replies
18h11m

Sorry but it was pretty obvious what he meant.

refulgentis
2 replies
15h18m

It's not, at all.

He could have meant speed, text, audio, words, or phonemes, with least probably images.

He probably didn't mean phonemes or he wouldn't be asking.

He probably didn't mean arbitrarily slicing 'real' audio and stitching on fake audio - he made repeated references to a video game.

He probably didn't mean inpainting and outpainting imagery, even though he made reference to a video game, because its an audio model.

Thank you for explaining I deserve to get downvoted through the floor multiple times for asking a question because it's "obvious". Maybe you can explain to the rest of the class what he meant then? If it was obviously phonemes, will you then advocate for them being downvoted through the floor since the answer was obvious? Or is it only people who assume good faith and ask what they meant who deserve downvotes?

IshKebab
1 replies
10h0m

Inpainting and outpainting of images is when the model generates bits inside or outside the image that don't exist. By analogy he was talking about generating sound inside (I.e. filling gaps) or outside (extrapolating beyond the end) the audio.

I don't know why you would think he was talking about inpainting images, words. This whole discussion is about speech synthesis.

refulgentis
0 replies
2h48m

Right, _until he brought up inpainting and outpainting_. And as I already laid out, the audio options made just about as much sense as the art.

I honestly can't believe how committed you are to explaining to me that as the only person who bothered answering, I'm the problem.

I've been in AI art when it was 10 people in an IRC room trying to figure out what to do with a bunch of GPUs an ex-hedge fund manager snapped up, and spent the last week working on porting eSpeak, the bedrock of ~all TTS models, from C++.

It wasn't "obvious" they didn't mean art, and it definitely was not obvious that they want to splice real voice clips at arbitrary points and insert new words without being a detectable fake for a video game. I needed more info to answer. I'm sorry.

kelseyfrog
2 replies
19h38m

Ignore the speed comment; it is unrelated to my question.

What I mean is, can output be conditioned on antecedent audio as well as text analogous to how image diffusion models can condition inpainting and outpatient on static parts of an image and clip embeddings?

refulgentis
1 replies
19h7m

Yes, the paper and Eleven Labs have a major feature of "given $AUDIO_SET, generate speech for $TEXT in the same style of $AUDIO_SET"

No, in that, you can't cut it at an arbitrary midword point, say at "what tim" in "what time is it bejing", and give it the string "what time is it in beijing", and have it recover seamlessly.

Yes, in that, you can cut it at an arbirtrary phoneme boundary, say 'this, I.S. a; good: test! ok?' in IPA is 'ðˈɪs, ˌaɪˌɛsˈeɪ; ɡˈʊd: tˈɛst! ˌoʊkˈeɪ?', and I can cut it 'between' a phoneme, give it the and have it complete.

kelseyfrog
0 replies
18h25m

Perfect! Thank you

huac
1 replies
19h37m

It is theoretically possible to train a model that, given some speech, attempts to continue the speech, e.g. Spectron: https://michelleramanovich.github.io/spectron/spectron/. Similarly, it is possible to train a model to edit the content, a la Voicebox: https://voicebox.metademolab.com/edit.html.

taneq
0 replies
14h42m

Great. :P

Me: Won’t it be great when AI can-

Computer: Finish your sentences for you? OMG that’s exactly what I was thinking!

JonathanFly
0 replies
17h32m

Are infill and outpainting equivalents possible?

Do you mean outpainting as in you still what words to do, or the model just extends the audio unconditionally the way some image models just expand past an image borders without a specific prompt (in audio like https://twitter.com/jonathanfly/status/1650001584485552130)

rahimnathwani
1 replies
12h39m

Thanks. Following the instructions now. BTW mamba is no longer recommended (for those like me who aren't already using it), and the #mambaforge anchor in the link didn't work.

lhl
0 replies
2h47m

I switched from conda to mamba a while ago and never looked back (it's probably saved dozens of hours from waiting for conda's slow as molasses package resolution). I'm looking at the latest docs and it doesn't look like there's any deprecation messages or anything (it does warn against installing mamba inside of conda, but that's been the case for a long time): https://mamba.readthedocs.io/en/latest/installation/mamba-in...

It looks like miniforge is still the recommended install method, but also the anchor has changed in the repo docs, which I've updated, thx. FWIW, I haven't run into any problems using mamba. While I'm not a power user, so there are edge cases I might have missed, but I have over 35 mamba envs on my dev machine atm, so it's definitely been doing the job for me and remains wicked fast (if not particularly disk efficient).

eigenvalue
15 replies
20h36m

Was somewhat annoying to get everything to work as the documentation is a bit spotty, but after ~20 minutes it's all working well for me on WSL Ubuntu 22.04. Sound quality is very good, much better than other open source TTS projects I've seen. It's also SUPER fast (at least using a 4090 GPU).

Not sure it's quite up to Eleven Labs quality. But to me, what makes Eleven so cool is that they have a large library of high quality voices that are easy to choose from. I don't yet see any way with this library to get a different voice from the default female voice.

Also, the real special sauce for Eleven is the near instant voice cloning with just a single 5 minute sample, which works shockingly (even spookily) well. Can't wait to have that all available in a fully open source project! The services that provide this as an API are just too expensive for many use cases. Even the OpenAI one which is on the cheaper side costs ~10 cents for a couple thousand word generation.

sandslides
7 replies
20h27m

The LibriTTS demo clones unseen speakers from a five second or so clip

eigenvalue
6 replies
20h14m

Ah ok, thanks. I tried the other demo.

eigenvalue
5 replies
19h52m

I tried it. Sounds absolutely nothing like my voice or my wife's voice. I used the same sample files as I used 2 days ago on the Eleven Labs website, and they worked flawlessly there. So this is very, very far from being close to "Eleven Labs quality" when it comes to voice cloning.

lewismenelaws
1 replies
17h59m

Yep. Tried as well. Tried a little clip of Tony Sopranos and it came out as a british guy.

xTTSv2 does it much better. But the quality on the trained voices are great though.

eigenvalue
0 replies
17h45m

Yes, same for my voice. Made me sound British and didn't capture anything special about my voice that makes it recognizable.

thot_experiment
0 replies
18h46m

Ah that's disappointing, have you tried https://git.ecker.tech/mrq/ai-voice-cloning ? I've had decent results with that, but inference is quite slow.

sandslides
0 replies
19h32m

The speech generated is the best I've heard from an open source model. The one test I made didn't make an exact clone either but this is still early days. There's likely something not quite right. The cloned voice does speak without any artifacts or other weirdness that most TTS systems suffer from.

jsjmch
0 replies
18h45m

ElevenLabs are based on Tortoise-TTS which was already pre-trained on millions of hours of data, but this one was only trained on LibriTTS which was 500 hours at best. If you have seen millions of voices, there are definitely gonna be some of them that sound like you. It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.

wczekalski
2 replies
20h31m

One thing I've seen done for style cloning is a high quality fine tuned TTS -> RVC pipeline to "enhance" the output. TTS for intonation + pronunciation, RVC for voice texture. With StyleTTS and this pipeline you should get close to ElevenLabs.

eigenvalue
0 replies
19h51m

I suspect they are doing many more things to make it sounds better. I certainly hope open source solutions can approach that level of quality, but so far I've been very disappointed.

KolmogorovComp
0 replies
5h47m

RVC? R… Voice Model?

wczekalski
1 replies
20h33m

have you tested longer utterances with both ElevenLabs and with StyleTTS? Short audio synthesis is a ~solved problem in the TTS world but things start falling apart once you want to do something like create an audiobook with text to speech.

wingworks
0 replies
18h37m

I can say that the paid service from ElevenLabs can do long form TTS very well. I used it for a while to convert long articles to voice to listen to later instead of reading. It works very well. I only stopped because it gets a little pricey.

eigenvalue
1 replies
18h54m

To save people some time, this is tested on Ubuntu 22.04 (google is being annoying about the download link, saying too many people have downloaded it in the past 24 hours, but if you wait a bit it should work again):

  git clone https://github.com/yl4579/StyleTTS2.git
  cd StyleTTS2
  python3 -m venv venv
  source venv/bin/activate
  python3 -m pip install --upgrade pip
  python3 -m pip install wheel
  pip install -r requirements.txt
  pip install phonemizer
  sudo apt-get install -y espeak-ng
  pip install gdown
  gdown https://drive.google.com/uc?id=1K3jt1JEbtohBLUA0X75KLw36TW7U1yxq
  7z x Models.zip
  rm Models.zip
  gdown https://drive.google.com/uc?id=1jK_VV3TnGM9dkrIMsdQ_upov8FrIymr7
  7z x Models.zip
  rm Models.zip
  pip install ipykernel pickleshare nltk SoundFile
  python -c "import nltk; nltk.download('punkt')"
  pip install --upgrade jupyter ipywidgets librosa
  python -m ipykernel install --user --name=venv --display-name="Python (venv)"
  jupyter notebook
  
Then navigate to /Demo and open either `Inference_LJSpeech.ipynb` or `Inference_LibriTTS.ipynb` and they should work.

degobah
0 replies
11h44m

Very helpful, thanks!

beltsazar
11 replies
20h23m

If AI will render some jobs obsolete, I suppose the first one will be audio book narrators and voice actors.

washadjeffmad
7 replies
19h27m

Hardly. Imagine licensing your voice to Amazon so that any customer could stream any book narrated in your likeness without you having to commit the time to record. You could still work as a custom voice artist, all with a "no clone" clause if you chose. You could profit from your performance and craft in a fraction of the time, focusing as your own agent on the management of your assets. Or, you could just keep and commit to your day job.

Just imagine hearing the final novel of ASoIaF narrated by Roy Dotrice and knowing that a royalty went to his family and estate, or if David Attenborough willed the digital likeness of his voice and its performance to the BBC for use in nature documentaries after his death.

The advent of recorded audio didn't put artists out of business, it expanded the industries that relied on them by allowing more of them to work. Film and tape didn't put artists out of business, it expanded the industries that relied on them by allowing more of them to work. Audio digitization and the internet didn't put artists out of business; it expanded the industries that relied on them by allowing more of them to work.

And TTS won't put artists out of business, but it will create yet another new market with another niche that people will have to figure out how to monetize, even though 98% of the revenues will still somehow end up with the distributors.

nikkwong
4 replies
19h9m

What you're not considering here is that a large majority of this industry is made up of no-name voice actors who have a pleasant (but perfectly substitutible) voice which is now something that AI can do perfectly and at a fraction of the price.

Sure, celebrities and other well-known figures will have more to gain here as they can license out their voice; but the majority of voice actors won't be able to capitalize on this. So this is actually even more perverse because it again creates a system where all assets will accumulate at the top and there won't be any distributions for everyone else.

washadjeffmad
3 replies
17h28m

No, I am. I work with them, and I've been one (am one, rarely).

I listed just one possible use, but I also see voice cloning and advanced TTS expanding access for evocative instruction, as an aid to study style and expand range.

Don't be afraid on their behalf. The dooming you're talking about applied to every one of the technological changes I already listed, and we employ more performers and artists today than ever in history.

When animation went digital, we graduated more storyboard artists and digital animators. When music notation software and sampling could replace musicians and orchestras, we graduated more musicians and composers trained on those tools. Now it's the performing arts, and no one in industry is going to shrink their pool of available talent (or risk ire) by daring conflate authenticity and performance with virtual impersonation. Performance capture and vfx also didn't kill or consolidate the movie industry - it allowed it to expand.

Art evolves, and so does its business. People who love art want to see people who do art succeed. I'm optimistic.

nikkwong
0 replies
11h48m

I don't know, I feel like the work produced through voice acting is more of a commodity than work in the other industries that you're describing. Sure, a voice actor can add a lot of emotion and verbal nuance in a way that is differentiating, but I'm not sure if the difference is enough to matter for most people for the vast majority of cases. (Or I may be too dense to realize it). This is in contradiction to say performing arts, where there are in my opinion many more dimensions to the creative output which makes it less perfectly substitutable.

hgomersall
0 replies
9h33m

TTS actually allows scope for far more different artists' likenesses to be incorporated. An book can be read with all the characters having a different voice entirely. This is difficult currently and relies on the skill of the performer.

beltsazar
0 replies
6h56m

When animation went digital, we graduated more storyboard artists and digital animators. When music notation software and sampling could replace musicians and orchestras, we graduated more musicians and composers trained on those tools.

What you explained is that tech has changed the tools used by artists.

It's substantially different with AI-based TTS, though. It's not a tool for artists, but it's a tool for movie/game/book publishers to replace human voice actors. The AI will be much much more scalable and cheaper.

vunderba
0 replies
12h40m

You're not really thinking it through. I have friends involved in the VA business, and it's only gotten more competitive as time has progressed - this is partially because it's rare that we need a voice actor that needs to create a crazy Looney Tunes sounding voice, the majority of VA work is surprisingly just close to the natural sounding voice of the VA themselves.

It's rare that you need a talent like Dan Castellaneta, Mel Blanc, etc.

Secondly, yes, VA licensing will become a thing – but that means that jobs that would previously be available to other lesser known voice actors, because the major players simply didn't have enough time to take those gigs, can no longer take them. A TTSVA can do unlimited recordings.

Thirdly, major studios that would require hundreds of voices for video games and other things don't have to license known voices at all, they can just create generate brand new ones and pay zero licensing fees.

bongodongobob
0 replies
19h8m

The point is no one will pay for any of that if you can just clone someone's voice locally. Or just tell the AI how you want it to sound. Your argument literally ignores the entire elephant in the room.

riquito
2 replies
18h21m

I can see a future where the label "100% narrated by a human" (and similar in other industries) will be a thing

fbdab103
0 replies
18h11m

A la, A Young Lady's Illustrated Primer.

amelius
0 replies
16h43m

"No humans were fired in the making of this film"

progbits
10 replies
22h59m

MIT license

Before using these models, you agree to [...]

No, this is not MIT. If you don't like MIT license then feel free to use something else, but you can't pretend this is open source and then attempt to slap on additional restrictions on how the code can be used.

weego
3 replies
22h40m

I think you mis-parsed the disclaimer. It's just warning people that cloned voices come with a different set of rights to the software (because the person the voice is a clone of has rights to their voice).

chrismorgan
2 replies
22h21m

(Don’t let’s derail the conversation, please, but “disclaimer” is completely the wrong word here. This is a condition of use. A disclaimer is “this isn’t mine” or “I’m not responsible for this”. Disclaimers and disclosures are quite different things and commonly confused, but this isn’t even either of them.)

gosub100
1 replies
21h9m

This always annoys me when people put "disclaimers" on their posts. IANAL, so tired of hearing that one. It's pointless because even if you were a lawyer, you cannot meaningfully comment on a case without the details, jurisdiction, circumstance, etc. Next, it's meaningless because is anyone going to blindly bow down and obey if you state the opposite? "Yes, I AM a lawyer, you do not need to pay taxes, they are unconstitutional." Thirdly, when they "disclaimer" themselves as working at google, that's not a dis-claimer, thats a "claimer", asserting the affirmative. I know their companies require them to not speak for the company without permission, but I hardly ever hear that one, usually its just some useless self-disclosure that they might be biased because they work there. Ok, who isn't biased?

What bugs me overall is that it's usually vapid mimicry of a phrase they don't even understand.

nielsole
0 replies
18h26m

Ianal, but giving legal advice without being a lawyer may be illegal in some jurisdictions. Not sure if the disclaimer is effective or was ever tested in court. The disclaimer/disclosure mix-up is super annoying, but disclosing obvious biases even if not legally required seems like good practice to me.

gpm
1 replies
22h40m

As I understand it the source code is licensed MIT, the weights are licensed "weird proprietary license that doesn't explicitly grant you any rights and implicitly probably grants you some usage rights so long as you tell the listeners or have permission from the voice you cloned".

Which, if you think the weights are copyright-able in the first place, makes them practically unusable for anything commercial/that you might get sued over because relying on a vague implicit license is definitely not a good idea.

ronsor
0 replies
19h3m

And if you don't think weights are copyrightable, it means nothing at all.

sandslides
0 replies
22h54m

Yes, I noticed that. Doesn't seem right does it

pdntspa
0 replies
22h12m

As if anyone outside of corporate legal actually cares

ericra
0 replies
22h22m

This bothered me as well. I opened an issue on the repo asking them to consider updating the license file to reflect these additional requirements.

The wording they currently use suggests that this additional license requirement applies not only to their pre-trained models.

IshKebab
0 replies
22h38m

I think that's referring to the pre-trained models, not the source code.

stevenhuang
9 replies
21h5m

I really want to try this but making the venv to install all the torch dependencies is starting to get old lol.

How are other people dealing with this? Is there an easy way to get multiple venvs to share like a common torch venv? I can do this manually but I'm wondering if there's a tool out there that does this.

eurekin
2 replies
20h27m

Same here. I'm using conda and eyeing simply installing a pytorch into the base conda env

lhl
1 replies
18h58m

I don't think "base" works like that (while it can be a fallback for some dependencies, afaik, Python packages are isolated/not in path). But even if you could, don't do it. Different packages usually have different pytorch dependencies (often CUDA as well) and it will definitely bite you.

The biggest optimization I've found is to use mamba for everything. It's ridiculously faster than conda for package resolution. With everything cached, you're mostly just waiting for your SSD at that point.

(I suppose you could add the base env's lib path to the end of your PYTHONPATH, but that sounds like a sure way to get bitten by weird dependency/reproducibility issues down the line.)

eurekin
0 replies
17h57m

Thank you! First time I come across. Looks very promising

lukasga
1 replies
20h32m

Can relate to this problem a lot. I have considered starting using a Docker dev container and making a base image for shared dependencies which I then can customize in a dockerfile for each new project, not sure if there's a better alternative though.

stevenhuang
0 replies
10h32m

Yeah there is the official Nvidia container with torch+cuda pre-installed that some projects use.

I feel more projects should start with that as the base instead of pinning on whatever variants. Most aren't using specialized CUDA kernels after all.

Suppose there's the answer, just pick the specific torch+CUDA base that matches the major version of the project you want to run. Then cross your fingers and hope the dependencies mesh :p.

amelius
1 replies
16h41m

is starting to get old lol.

If it's starting to get old, then this means that an LLM like Copilot should be able to do it for you, no?

stevenhuang
0 replies
10h50m

I mean that I already have like 10 different torch venvs for different projects all with various pinned versions and CUDA variants.

Still worth the trade-off of not having to deal with dependency hell, but you start to wonder if there is a better way. All together this is many GBs of duplicated libs, wasted bandwidth and compute.

wczekalski
0 replies
21h1m

I use nix to setup the python env (python version + poetry + sometimes python packages that are difficult to install with poetry) and use poetry for the rest.

The workflow is:

  > nix flake init -t github:dialohq/flake-templates#python
  > nix develop -c $SHELL
  > # I'm in the shell with poetry env, I have a shell hook in the nix devenv that does poetry install and poetry activate.

stavros
0 replies
18h23m

I generally try to use Docker for this stuff, but yeah, it's the main reason why I pass on these, even though I've been looking for something like this. It's just too hard to figure out the dependencies.

kats
8 replies
9h27m

This is really harmful and unethical work. It will be used to hurt millions of elderly people with scams. That's the real application that will happen 100x more than anything else. It's unethical and harmful to release tools that will be overwhelmingly used to hurt elderly people. What they should do about it is: Stop releasing models. Only release a service so that scammers will not use it. Also, only released audio that is watermarked, so that apps can tell that a phone call might be a scam. When they share models with researchers, use previous best practices: post a Google Form to request access.

flarg
3 replies
9h21m

Millions of elderly people are already getting scammed by overseas call centers so unless we do something more significant this tech will not make one iota of a difference.

kats
2 replies
9h20m

That's not really true, most scammers have a male voice with a heavy accent. When they have tools that easily disguise their voice, scammers can reach many more elderly people.

slow_numbnut
1 replies
9h3m

That might have been true about a year ago, but I've been getting calls from well-spoken native-level scammers for about two months now. They are so frequent that I can put them on speaker during family gatherings to raise awareness.

Sample sizes of 1 are never representative but they definitely have full access to native speakers or tech that can generate very passable speech.

maeil
0 replies
6h59m

It seems quite possible that the change you've seen in these last two months is because some have started using these models. More likely than a sudden huge shift in either the country of origin or English skills of the scammers.

slow_numbnut
1 replies
9h9m

Just imagine if this line of thinking was used elsewhere.

This tech is already out of the bag and I thank the author(s) for the contribution to humanity. The correct solution here is not to shove your head in the sand and ignore reality, but to get your government to penalize any country or company that facilitates this crime. If they can force severe penalties for other financial crimes and funding terrorism, they can do the same here.

kats
0 replies
8h58m

it's funny because just yesterday I posted:

soon as it's out, a whole bunch of extremely privileged ML people will throw their hands up and say, "oh well, cats out of the bag."

https://news.ycombinator.com/context?id=38324742

mx20
0 replies
3h36m

Scammers scamming old people is already very wide spread, so should we maybe outlaw telephones as well? Or maybe mandate anti scamming filters that disconnect if something is discussed that could be a scam? If I think about it that actually would make more sense, but still be problematic.

127
0 replies
7h52m

Cars actually kill over a million of people per year. Not saying this is good, just that all technology has its tradeoffs.

satvikpendem
7 replies
22h5m

Funnily enough, the TTS2 examples sound better than the ground truth [0]. For example, the "Then leaving the corpse within the house [...]" example has the ground truth pronounce "house" weirdly, with some change in the tonality that sounds higher, but the TTS2 version sounds more natural.

I'm excited to use this for all my ePub files, many of which don't have corresponding audiobooks, such as a lot of Japanese light novels. I am currently using Moon+ Reader on Android which has TTS but it is very robotic.

[0] https://styletts2.github.io/

risho
4 replies
21h38m

how are you planning on using this with epubs? i'm in a similar boat. would really like to leverage something like this for ebooks.

satvikpendem
3 replies
21h34m

I wonder if you can add a TTS engine to Android as an app or plugin, then make Moon+ Reader or another reader to use that custom engine. That's probably how I'd do it for the easiest approach, but if that doesn't work, I might just have to make my own app.

a_wild_dandan
1 replies
20h25m

I’m planning on making a self-host solution where you can upload files and the host sends back the audio to play, as a first pass on this tech. I’ll open source the repo after fiddling and prototyping. I’ve needed this kinda thing for a long time!

risho
0 replies
18h28m

Please make sure to link it back to HN so that we can check it out!

jrpear
0 replies
19h55m

You can! [rhvoice](https://rhvoice.org/) is an open source example.

qingcharles
0 replies
11h55m

First Wife is a professional voice-over actor. I saw someone left her a bad review saying "Clearly an AI."

2023. There is no way to win.

KolmogorovComp
0 replies
20h25m

The pace is better, but imho you there is still a very noticeable “metalic” tone which makes it inferior to the real thing.

Impressive results nonetheless, and superior to all other TTS.

jasonjmcghee
6 replies
16h13m

Out of curiosity - to folks that have had success with this...

This voice cloning is... nothing like XTTSv2, let alone ElevenLabs.

It doesn't seem to care about accents at all. It does pretty well with pitch and cadence, and that's about it.

I've tried all kinds of different values for alpha, beta, embedding scale, diffusion steps.

Anyone else have better luck?

Sure it's fast and the sound quality is pretty good, but I can't get the voice cloning to work at all.

jsjmch
3 replies
13h9m

See my previous comment about this point. ElevenLabs are based on Tortoise-TTS which was already pre-trained on millions of hours of data, but this one was only trained on LibriTTS which was 500 hours at best. XTTS was also trained with probably millions of speakers in more than 20 languages.

If you have seen millions of voices, there are definitely gonna be some of them that sound like you. It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.

wczekalski
1 replies
10h35m

What's your basis for the claim that they are based on TorToiSe? I have seen this claim made (and rebutted) many times.

jsjmch
0 replies
8h51m

Very similar features, quite slow inference speed, and various rumors.

lossolo
0 replies
11h39m

It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.

It's really not that difficult, they are trained mostly on audiobooks and high quality audio from yt videos. If we talk about EV model then we are talking about around 500k hours of audio, but Tortoise-TTS is only around 50k from what I remember.

dsrtslnd23
0 replies
15h27m

See the conclusion remarks in the paper - they acknowledge that voice cloning is not that good (yet).

carbocation
0 replies
15h33m

I had the same experience as what you described (with a lot of experimentation with alpha and beta, as well as uploading different audio clips).

godelski
6 replies
22h14m

Why name it Style<anything> if it isn't a StyleGAN? Looks like the first one wasn't either. Interesting to see moves away from flows, especially when none of the flows were modern.

Also, is no one clicking on the audio links? There are some... questionable ones... and I'm pretty sure lots of mistakes.

lhl
3 replies
21h46m

It's not called a GAN TTS right? StyleGAN is called what it is because of a "style-based" approach and StyleTTS/2 seems to be doing the same (applying style transfer) through different method (and disentangling style from the rest of the voice synthesis).

(Actually, looked at the original StyleTTS paper and it actually even partially uses AdaIN in the decoder, which is the same way that StyleGAN injected style information? Still, I think is besides the point for the naming.)

godelski
2 replies
19h53m

Yeah no I get this but the naming convention has become so prolific that anyone working in generative space hears "Style<thing>" and you should think "GAN". (I work in generative vision btw)

My point is not that it is technically right, it is that the name is strongly related with the concept now. Such that if you use a style based network and don't name it StyleX that it's odd and might look like you're trying to claim you've done more. Not that there aren't plenty of GANs that are using Karras's code and called something else.

AdaIN

Yes, StyleGAN (version 1) uses AdaIN but StyleGAN2 (and beyond) doesn't. AdaIN stands for Adaptive Instance Normalization. While they use it in that network, to be clear, they did not invent AdaIN and the technique isn't explicit to style, it's a normalization technique. One that StyleGAN2 modifies because the standard one creates strong and localized spikes in the statistics which results in image artifacts.

lhl
1 replies
19h7m

So what I'm hearing is... no one should use "style" in its name anymore to describe style transfers because it's too closely associated with a set of models in a sub-field that uses a different concept to apply style that used "style" in its name, unless it also uses that unrelated concept in its implementation? Is that the gist of it, because that sounds a bit mental.

(I'm half kidding, I get what you mean, but also, think about it. The alternative is worse.)

godelski
0 replies
16h44m

I'm half kidding, I get what you mean

I mean yeah, I'm not saying that they shouldn't be able to use the name. There's no control of "StyleX" but it certainly is a poor choice that can lead to confusion. That's all I'm getting at. 100% this is an opinion (would be insane if believed to be anything else).

I don't think it is just a "sub-field" as you mention and it definitely isn't like StyleGAN isn't known by nearly every person that learns ML (I have seen very few courses that do not mention it, but those tend to be ones that don't discuss generation at all). StyleGAN is one of the most well known models that exist. Up there with GPT, YOLO, and ViT. Realistically we use these names as a style of model now rather than the actual original model themselves (or somewhat interchangeably).

The original StyleTTS's abstract has the line

Here, we propose StyleTTS, a style-based generative model for parallel TTS

And I certainly would not blame anyone for thinking "Oh, they're using a StyleGAN". That's all I'm saying. Their style encoder looks nothing like the StyleGAN's style encoder. It looks a bit closer to the synthesis network but that's just because they're using Leaky ReLUs and AdaIN, but like we said before, that's not really a StyleGAN specific thing. There are also other parts we could say look similar but they are pretty generic sections that I wouldn't particularly think uniquely pertains to StyleGAN architectures (or StyleDiffusion ones that do make this callback).

It other words, it's like naming something iX. Sure, Apple doesn't have complete control over a leading letter but I also understand Apple's claim that such a naming pattern can confuse people. Certainly a name collision. Hell, I'll say that the authors that made this paper knew what they were doing https://arxiv.org/abs/2212.01452

I just think they can come up with a better name that has worse chance of collision. It's not like StyleTTS is a particularly creative name or even that apt of a description either. Names are important because they do mean things. You may think it is not a poor choice of naming and that's okay too. But we also work in different fields too and I'd argue that research papers are aimed at other researchers, where I would be surprised if anyone works in generations (image, voice, language, data, whatever) is not well aware of the Style based networks. Because I can use that sentence and it make sense to most ML people.

gwern
1 replies
21h47m

Looks like the first one wasn't either.

The first one says it uses AdaIN layers to help control style? https://arxiv.org/pdf/2205.15439.pdf#page=2 Seems as justifiable as the original StyleGAN calling itself StyleX...

godelski
0 replies
19h52m

See my other comment. StyleGAN isn't about AdaIN. StyleGAN2 even modified it.

gjm11
6 replies
21h43m

HN title at present is "StyleTTS2 – open-source Eleven Labs quality Text To Speech". Actual title at the far end doesn't name any particular other product; arXiv paper linked from there doesn't mention Eleven Labs either. I thought this sort of editorializing was frowned on.

stevenhuang
3 replies
21h9m

Eleven Labs is the gold standard for voice synthesis. There is nothing better out there.

So it is extremely notable for an open source system to be able to approach this level of quality, which is why I'd imagine most would appreciate the comparison. I know it caught my attention.

lucubratory
1 replies
19h47m

OpenAI's TTS is better than Eleven Labs, but they don't let you train it to have a particular voice out of fear of the consequences.

huac
0 replies
19h35m

I concur that, for the use cases that OpenAI's voices cover, it is significantly better than Eleven.

yreg
0 replies
11h58m

But is this even approaching Eleven? Doesn't seem like it from the other comments here.

modeless
0 replies
18h54m

It is editorializing and it is an exaggeration. However I've been using StyleTTS2 myself and IMO it is the best open source TTS by far and definitely deserves a spot on the top of HN for a while.

GaggiX
0 replies
20h57m

Yes, it's against the guidelines. In fact, when I read the title, I didn't think it was a new research paper but a random GitHub project.

sandslides
5 replies
23h30m

Just tried the collab notebooks. Seems to be very good quality. It also supports voice cloning.

fullstackchris
3 replies
23h0m

Great stuff, took a look through the README but... what are the minimum hardware requirements to run this? Is this gonna blow up my CPU / harddrive?

sandslides
2 replies
22h54m

Not sure. The only inference demos are colab notebooks. The models are approx 700mb each so I imagine it will run on modest gpu

bbbruno222
1 replies
22h28m

Would it run in a cheap non-GPU server?

dmw_ng
0 replies
20h57m

Seems to run about "2x realtime" on 2015 4 core i7-6700HQ laptop, that is, 5 seconds to generate 10 seconds of output. Can imagine that being 4x or greater on a real machine

thot_experiment
0 replies
19h38m

I skimmed the github but didn't see any info on this, how long does it take to finetune to a particular voice?

zsoltkacsandi
3 replies
18h58m

Is it possible to optimize somehow the model to run a Raspberry with 4 GB of RAM?

zsoltkacsandi
2 replies
17h20m

I was able to get it work with libjemalloc.

GaggiX
1 replies
16h36m

How fast is it on your raspberry?

zsoltkacsandi
0 replies
12h23m

Super slow. On my Mac Mini the inference was running in seconds, on Raspberry, minutes.

wg0
2 replies
19h32m

The quality is really really INSANE and pretty much unimaginable in early 2000s.

Could have interesting prospects for games where you have LLM assuming a character and such TTS giving those NPCs voice.

abraae
1 replies
19h24m

This is a big thing for one area I'm interested in - golf simulation.

Currently playing in a golf simulator has a bit of a post-apocalyptian vibe. The birds are cheeping, the grass is rustling, the game play is realistic, but there's not a human to be seen. Just so different from the smacktalking of a real round, or the crowd noise at a big game.

It's begging for some LLM-fuelled banter to be added.

billylo
0 replies
18h36m

Or the occasional "Fore!!"s. :-)

victorbjorklund
2 replies
20h55m

This only works for English voices right?

e12e
1 replies
19h35m

No? From the readme:

In Utils folder, there are three pre-trained models:

    ASR folder: It contains the pre-trained text aligner, which was pre-trained on English (LibriTTS), Japanese (JVS), and Chinese (AiShell) corpus. It works well for most other languages without fine-tuning, but you can always train your own text aligner with the code here: yl4579/AuxiliaryASR.

    JDC folder: It contains the pre-trained pitch extractor, which was pre-trained on English (LibriTTS) corpus only. However, it works well for other languages too because F0 is independent of language. If you want to train on singing corpus, it is recommended to train a new pitch extractor with the code here: yl4579/PitchExtractor.

    PLBERT folder: It contains the pre-trained PL-BERT model, which was pre-trained on English (Wikipedia) corpus only. It probably does not work very well on other languages, so you will need to train a different PL-BERT for different languages using the repo here: yl4579/PL-BERT. You can also replace this module with other phoneme BERT models like XPhoneBERT which is pre-trained on more than 100 languages.

modeless
0 replies
18h52m

Those are just parts of the system and don't make a complete TTS. In theory you could train a complete StyleTTS2 for other languages but currently the pretrained models are English only.

deknos
2 replies
9h12m

Is this really opensource and/or free software? like code, data(set/s) and models?

I am quite tired to see some "open-source" advertisement, where the half or more is not really free.

general psa: please be honest in your announcements :|

acheong08
1 replies
4h17m

MIT licensed. Models, code, and everything is available right there when you click the link.

Maybe actually check it out before complaining.

mx20
0 replies
2h43m

But you are wrong the trained models are separate on Google Drive and have following Text that seems to be an additional License Agreement that also includes using the software and any trained Modell.

License Part 2 Text: "Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices pubilc, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices."

api
2 replies
22h12m

It should be pretty easy to make training data for TTS. The Whisper STT models are open so just chop up a ton of audio and use Whisper to annotate it, then train the other direction to produce audio from text. So you’re basically inverting Whisper.

nmfisher
0 replies
17h6m

I think you’re talking about just using Whisper to annotate audio for a TTS pipeline but someone from Collabora actually created a TTS model directly from Whisper embeddings https://github.com/collabora/WhisperSpeech

eginhard
0 replies
19h47m

STT training data includes all kinds of "noisy" speech so that the model learns to recognise speech in any conditions. TTS training data needs to be as clean as possible so that you don't introduce artefacts in the output and this high-quality data is much harder to get. A simple inversion is not really feasible or at least requires filtering out much of the data.

visarga
1 replies
17h42m

Yes, please integrate it with Mistral and Whisper. This has got to get into the LLM frontends.

modeless
0 replies
14h56m

Done: https://apps.microsoft.com/detail/9NC624PBFGB7

It's mostly just a demo for now and a little bit janky but it's fun to chat with and you can see the promise for 100% local voice AI in the future.

svapnil
1 replies
20h39m

How fast is inference with this model?

For reference, I'm using 11Labs to synthesize short messages - maybe a sentence or something, using voice cloning, and I'm getting it at around 400 - 500ms response times.

Is there any OS solution that gets me to around the same inference time?

wczekalski
0 replies
20h35m

It depends on hardware but IIRC on V100s it took 0.01-0.03s for 1s of audio.

mazoza
1 replies
16h55m

meh this is not that good. Sounds quite boring.

ChildOfChaos
0 replies
7h13m

Agreed, this isn't Eleven labs quality at all.

exizt88
1 replies
17h35m

The weights aren’t MIT-licensed, so this is not usable in commercial applications, right?

acheong08
0 replies
4h12m

It is usable in commercial applications given you disclose the use of AI. This applies only to the pre-trained models. You can train your own from scratch without these restrictions.

You can fine tune it on your own voice and also not be required to disclose the use of AI.

wanderingmind
0 replies
15h38m

As a tangent away from LLMs, is there an integration available to be used in Android as TTS Engine?. The TTS voice that I have now (RHVoice) for OSMAnd is really driving me crazy and almost makes me want to go back to Google Maps.

wahnfrieden
0 replies
19h27m

Is there a way to port this to iOS? Apple doesn't provide an API for their version of this.

tomcam
0 replies
19h59m

Very impressive. It would take me a long time to even guess that some of these are text to speech.

swyx
0 replies
19h39m

silicon valley is very leaky, eleven labs is widely rumored to have raised a huge round recently. great timing because with OpenAI's TTS and now this thing the options in the market have just expanded greatly.

readyplayernull
0 replies
19h36m

Someone please create a TTS with marked-down emotions/intonations.

lxe
0 replies
13h47m

Wow this thing is wicked fast!

lfmunoz4
0 replies
11h27m

Been looking for a speech to text that can work in real time and run locally, anyone know which are the best options available?

jasonjmcghee
0 replies
22h2m

I've been playing with XTTSv2 and on my 3080ti, and it's sightly faster than the length of the final audio. It's also good quality, but these samples sound better.

Excited to try it out!

ddmma
0 replies
19h24m

Well done, been waiting for a moment like this. Will give it a try!

causality0
0 replies
18h44m

What are the chances this gets packaged into something a little more streamlined to use? I have a lot of ebooks I'd love to generate audio versions of.

carbocation
0 replies
18h36m

Having now tried it (the linked repo links to pre-built colab notebooks):

1) It does a fantastic job of text-to-speech.

2) I have had no success in getting any meaningful zero-shot voice cloning working. It technically runs and produces a voice, but it sounds nothing like the target voice. (This includes trying their microphone-based self-voice-cloning option.)

Presumably fine-tuning is needed - but I am curious if anyone had better luck with the zero-shot approach.

carbocation
0 replies
19h54m

Curious if we'll see a Civitai-style LoRA[1] marketplace for text-to-speech models.

1 = https://github.com/microsoft/LoRA

acheong08
0 replies
4h19m

I am an introvert: I rarely socialize, listen to podcasts at 2x speed, and mostly use subtitles rather than listening to audio for movies; therefore having a below average ability to differentiate humans/robots.

I asked someone to play the recordings for me to differentiate. I could not tell which was human (only between StyleTTS2 and Ground truth. The others were obvious)

Havoc
0 replies
16h42m

Those sound incredibly good.

Though would def like to clone a pleasant voice on it before using. Those sound good but not my cup of tea

GaggiX
0 replies
16h24m

They really should have uploaded the models on Huggingface than Gdrive.

Evidlo
0 replies
20h32m

What's a ballpark estimate for inference time on a modern CPU?