return to table of content

Stable Diffusion 3: Research Paper

WiSaGaN
26 replies
6h16m

More and more companies that were once devoted to being 'open', or were previously open, are now becoming increasingly closed. I appreciate Stability AI releases these research papers.

loudmax
16 replies
5h32m

It's hard to build a business on "open". I'm not sure what Stability AI's long term direction will be, but I hope they do figure out a way to become profitable while creating these free models.

sharmajai
12 replies
4h3m

Maybe not everything should be about business.

smith7018
5 replies
3h48m

Agreed but this isn't the same as an open source library; it costs A LOT of money to constantly train these models. That money has to come from somewhere, unfortunately.

TehCorwiz
4 replies
3h27m

Yeah. The amount of compute required is pretty high. I wonder, is there enough distributed compute available to bootstrap a truly open model through a system like seti@home or folding@home?

Filligree
3 replies
3h23m

The compute exists, but we'd need some conceptual breakthroughs to make DNN training over high-latency internet links make sense.

altruios
1 replies
2h0m

Distributing the training data also opens up vectors of attack. Poisoning or biasing the dataset distributed to the computer needs to be guarded against... but I don't think that's actually possible in a distributed model (in principal?). If the compute is happing off server: then trust is required (which is not {efficiently} enforceable?).

TehCorwiz
0 replies
53m

Trust is kinda a solved problem in distributed computing, The different "@Home" projects and Bitcoin handle this by requiring multiple validations of a block of work for just this reason.

pksebben
0 replies
3h4m

Forward-Forward looked promising, but then Hinton got the AI-Doomer heebie-jeebies and bailed. Perhaps someone picks up the concept and runs with it - I'd love to myself but I don't have the skillz to build stuff at that depth, yet.

TehCorwiz
1 replies
3h52m

I agree, but Y-Combinator literally only exists to squeeze the most bizness out of young smart people. That's why you're not seeing so much agreement.

phkahler
0 replies
1h55m

> but Y-Combinator literally only exists to squeeze the most bizness out of young smart people.

YC started out with the intent to give young smart people a shot at starting a business. IMHO it has shifted significantly over the years to more what you say. We see ads now seeking a "founding engineer" for YC startups, but it used to be the founders were engineers.

mvkel
0 replies
2h46m

The choice facing many companies that insist on remaining "open" is:

Do you want to 1. be right

or

2. stay in business

This is one of the reasons why OpenAI pivoted to be closed. Not bc of greedy value extractors; because it was the only way to survive.

mikkom
0 replies
2h14m

That was basically why openai was founded.

Too bad they decided to get greedy :-(

ben_w
0 replies
3h0m

Great, but aren't they simultaneously losing money and getting sued?

baq
0 replies
3h52m

Maybe. Paychecks help with not being hungry, though.

I’d be happy if my government or EU or whatever offered cash grants for open research and open weights in AI space.

The problem is, everyone wants to be a billionaire over there and it’s getting crowded.

michaelt
1 replies
1h26m

Maybe, but in image generation it's also hard to be closed.

The big providers are all so terrified they'll produce a deepfake image of obama getting arrested or something, the models are so locked down they only seem capable of producing stock photos.

sandworm101
0 replies
58m

> The big providers are all so terrified they'll produce a deepfake image of obama getting arrested or something

I think the content they are worried about is far darker than an attempt to embarrass a former president.

pleasantpeasant
0 replies
1h6m

The internet wouldn't have become as big as it is, if it wasn't for the internet's open source models.

The internet has been taken over by Capitalist and have ruined the internet, in my opinion.

londons_explore
7 replies
5h30m

But they used to let you download the model weights to run on your own machine... But stable diffusion 3 is just in 'limited preview' with no public download links.

One has to wonder, why the delay?

Sharlin
4 replies
5h4m

How is a closed beta anything out of the ordinary? They know they would only get tons of shit flinged at them if they publicly released something beta-quality, even if clearly labeled as such. SD users can be a VERY entitled bunch.

causal
1 replies
2h44m

I've noticed a strange attitude of entitlement that seems to scale with how open a company is - Mistral and Stable Diffusion are on very sensitive ground with the open source community despite being the most open.

idle_zealot
0 replies
35m

If you try to court a community then it will expect more of you. Same as if you were to claim to be an environmentalist company then you would receive more scrutiny from environmentalists confirming your claims.

Filligree
1 replies
4h3m

Moreover, people would start training on the beta model, splitting the ecosystem if it doesn't die entirely. There's nothing good in that timeline.

Sharlin
0 replies
3h16m

Uff, that's a good point.

nuz
0 replies
5h15m

Both SD1.4 and SDXL was in limited preview for a few months before a public release. This has been their normal course of business for about 2 years now (since founding). They just do this to improve the weights via a beta test with less judgemental users before official release.

cthalupa
0 replies
5h25m

That's nothing new with Stability. Even 1.5 was "released early" by RunwayML because they felt Stability was taking too long to release the weights instead of just providing them in DreamStudio.

Stability will release them in the coming weeks.

caycep
0 replies
45m

are they still the commercial affiliate of the CompVis group at Ludwig Maximillian University?

whywhywhywhy
8 replies
8h2m

It's impressive that it spell words correctly and lay them out but the issue I have is the text always has this distinctively overly fried look to it. The color of the text is always ramped up to a single value which when placed into a high fidelity image gives the impression of just slapping some text on top with photoshop afterwards in quite an amateurish fashion rather than text properly integrated into an image.

bsenftner
2 replies
5h57m

I'm expecting at some point the stable diffusion community of developers to recognize the value of Layered Diffusion, the method of generating elements with transparent backgrounds, and transitioning to outputs that are layered images one may access and tweak independently. The addition of that would make the hands-on media producers of the world say "okay, now we're talking, finally directly indigestible into our existing production pipelines."

cthalupa
1 replies
5h23m

There's already ComfyUI nodes for Layered Diffusion. https://github.com/huchenlei/ComfyUI-layerdiffuse

Of the people I know in the CG industries using SD in any sort of pipeline, they're all using Comfy because a node based workflow is what they're used to from things like Substance Designer, Houdini, Nuke, Blender, etc.

bsenftner
0 replies
5h3m

That's my impression as well. I used to work in VFX, my entire career is pretty much 3D something or other, over and over.

viraptor
1 replies
6h44m

It's just the presented examples. See the first preview for examples of properly integrated text https://stability.ai/news/stable-diffusion-3

Especially the side of the bus.

MyFirstSass
0 replies
4h53m

Side of the bus still looks weird though. Like some lower resolution layer was transformed to the side, also still to bright.

Still impressive we've come to this but we're now in the uncanny valley which is a badge of honour tbh.

imiric
1 replies
7h51m

But the sample images they show here showcase a good job at blending text with the rest of the image, using the correct art style, composition, shading and perspective. It seems like an improvement, no?

blehn
0 replies
3h54m

The blending looks better, but the LED sign on the bus for example looks almost like handwritten lettering... the letters are all different heights and widths. Not even close to realistic. There's a lot of nuance that goes into getting these things right. It seems like it'll be stuck in an uncanny valley for a long time.

GaggiX
0 replies
5h13m

It's very likely an artifact of CFG (classifier-free guidance), hopefully some days will be able to ditch this kinda dubious trick.

This is also the reason why the generated images have this characteristic high contrast and saturation. Better models usually need to rely less on CFG to generate coherent images because they fit the training distribution better.

finnjohnsen2
6 replies
7h43m

Question is, will SD3 be downloadable? I downloaded and run the early SD locally and it is really great.

Or did we lose Stable Diffusion to SAAS also? Like we did on many of the LLMs which started of so promising as for self hosting goes

sen
3 replies
7h40m

Sounds like it’ll be downloadable. FTA:

In early, unoptimized inference tests on consumer hardware our largest SD3 model with 8B parameters fits into the 24GB VRAM of a RTX 4090 and takes 34 seconds to generate an image of resolution 1024x1024 when using 50 sampling steps. Additionally, there will be multiple variations of Stable Diffusion 3 during the initial release, ranging from 800m to 8B parameter models to further eliminate hardware barriers.
nuz
1 replies
5h4m

The 800m model is super exciting

jncfhnb
0 replies
1h59m

It will probably suck. These models aren’t quite good enough for most tasks (other than toy fun exploration). They’re close in the sense that you can get there with a lot of work and dice rolling. But I would be pessimistic about a smaller model actually getting you where you want.

finnjohnsen2
0 replies
7h37m

Thanks for pointing that out. Super promising.

emadm
0 replies
2h19m

Yeah will all be downloadable weights. 800m, 2b and 8b currently planned.

Mashimo
0 replies
5h31m

It looks like it. I really hope they do. Running SDXL right now is propper fun. I don't even use it for anything specific, just to amuse myself at times :D

TheAceOfHearts
4 replies
7h3m

It's very exciting to see that image generators are finally figuring out spelling. When DALL-E 3 (?) came out they hyped up spelling capabilities but when I tried it with Bing it was incredibly inconsistent.

I'd love to read a less technical writeup explaining the challenges faced and why it took so long to figure out spelling. Scrolling through the paper is a bit overwhelming and it goes beyond my current understanding of the topic.

Does anyone know if it would be possible to eventually take older generated images with garbled up text + their prompt and have SD3 clean it up or fix the text issues?

vergessenmir
0 replies
6h58m

I would imagine with an img2img workflow it would be. The same way you can reconstruct a badly rendered face but doing a second pass on the affected region

emadm
0 replies
2h19m

Yes, this is possible, we have ComfyUI workflows for this.

declaredapple
0 replies
5h14m

The best way to do it right now is controlnets.

I'm not sure about re-doing the text of old images - you could try img2img but coherence is an issue, more controlnets might help

edshiro
2 replies
3h19m

This is really exciting to see. I applaud Stability AI's commitment to open source and hope they can operate for as long as possible.

There was one thing I was curious about... I skimmed through the executive summary of the paper but couldn't find it. Does Stable Diffusion 3 still use CLIP from Open AI for tokenization and text embeddings? I would naively assume that they would try to improve on this part of the model's architecture to improve adherence to text and image prompts.

ollin
0 replies
2h36m

They use three text encoders to encode the caption:

1. CLIP-G/14 (OpenCLIP)

2. CLIP-L/14 (OpenAI)

3. T5-v1.1-XXL (Google)

They randomly disable encoders during training, so that when generating images SD3 can use any subset of the 3 encoders. They find that using T5 XXL is important only when generating images from prompts with "either highly detailed descriptions of a scene or larger amounts of written text".

MrCheeze
0 replies
3h6m

One of the diagrams says they're using CLIP-G/14 and CLIP-L/14, which are the names of two OpenCLIP models - meaning they're not using OpenAI's CLIP.

vessenes
0 replies
2h42m

This looks great, very exciting. The paper is not a lot more detailed than the blog. The main Thing about the paper is they have an architecture that can include more expressive text encoders (t5-xxl here), they show this helps with complex scenes, and it seems clear they haven’t maxed out this stack in terms of training. So, expect sd3.1 to be better than this, and expect 4 to be able to work with video through adding even more front end encoding. Exciting!

nojvek
0 replies
49m

He! in contrast to Stability AI, Open AI is the least closed AI lab. Even Deep Mind publishes more papers.

I wonder if anyone in Open AI openly says it "We're in for the money!"

The recent letter by SamA regarding Elon's trial had as much truth as Putin saying they are invading Ukraine for de-nazification.