HN comments for: Stable Diffusion 3: Research Paper

WiSaGaN

26 replies

6h16m

2024-03-05 12:09:17 UTC

More and more companies that were once devoted to being 'open', or were previously open, are now becoming increasingly closed. I appreciate Stability AI releases these research papers.

loudmax

16 replies

5h32m

2024-03-05 12:53:25 UTC

It's hard to build a business on "open". I'm not sure what Stability AI's long term direction will be, but I hope they do figure out a way to become profitable while creating these free models.

sharmajai

12 replies

4h3m

2024-03-05 14:23:05 UTC

Maybe not everything should be about business.

smith7018

5 replies

3h48m

2024-03-05 14:38:09 UTC

Agreed but this isn't the same as an open source library; it costs A LOT of money to constantly train these models. That money has to come from somewhere, unfortunately.

TehCorwiz

4 replies

3h27m

2024-03-05 14:58:23 UTC

Yeah. The amount of compute required is pretty high. I wonder, is there enough distributed compute available to bootstrap a truly open model through a system like seti@home or folding@home?

Filligree

3 replies

3h23m

2024-03-05 15:02:41 UTC

The compute exists, but we'd need some conceptual breakthroughs to make DNN training over high-latency internet links make sense.

altruios

1 replies

2h0m

2024-03-05 16:25:51 UTC

Distributing the training data also opens up vectors of attack. Poisoning or biasing the dataset distributed to the computer needs to be guarded against... but I don't think that's actually possible in a distributed model (in principal?). If the compute is happing off server: then trust is required (which is not {efficiently} enforceable?).

TehCorwiz

0 replies

53m

2024-03-05 17:32:43 UTC

Trust is kinda a solved problem in distributed computing, The different "@Home" projects and Bitcoin handle this by requiring multiple validations of a block of work for just this reason.

pksebben

0 replies

3h4m

2024-03-05 15:21:21 UTC

Forward-Forward looked promising, but then Hinton got the AI-Doomer heebie-jeebies and bailed. Perhaps someone picks up the concept and runs with it - I'd love to myself but I don't have the skillz to build stuff at that depth, yet.

TehCorwiz

1 replies

3h52m

2024-03-05 14:34:09 UTC

I agree, but Y-Combinator literally only exists to squeeze the most bizness out of young smart people. That's why you're not seeing so much agreement.

phkahler

0 replies

1h55m

2024-03-05 16:30:34 UTC

> but Y-Combinator literally only exists to squeeze the most bizness out of young smart people.

YC started out with the intent to give young smart people a shot at starting a business. IMHO it has shifted significantly over the years to more what you say. We see ads now seeking a "founding engineer" for YC startups, but it used to be the founders were engineers.

mvkel

0 replies

2h46m

2024-03-05 15:39:33 UTC

The choice facing many companies that insist on remaining "open" is:

Do you want to 1. be right

2. stay in business

This is one of the reasons why OpenAI pivoted to be closed. Not bc of greedy value extractors; because it was the only way to survive.

mikkom

0 replies

2h14m

2024-03-05 16:11:38 UTC

That was basically why openai was founded.

Too bad they decided to get greedy :-(

ben_w

0 replies

3h0m

2024-03-05 15:25:23 UTC

Great, but aren't they simultaneously losing money and getting sued?

baq

0 replies

3h52m

2024-03-05 14:34:10 UTC

Maybe. Paychecks help with not being hungry, though.

I’d be happy if my government or EU or whatever offered cash grants for open research and open weights in AI space.

The problem is, everyone wants to be a billionaire over there and it’s getting crowded.

michaelt

1 replies

1h26m

2024-03-05 16:59:39 UTC

Maybe, but in image generation it's also hard to be closed.

The big providers are all so terrified they'll produce a deepfake image of obama getting arrested or something, the models are so locked down they only seem capable of producing stock photos.

sandworm101

0 replies

58m

2024-03-05 17:28:09 UTC

> The big providers are all so terrified they'll produce a deepfake image of obama getting arrested or something

I think the content they are worried about is far darker than an attempt to embarrass a former president.

pleasantpeasant

0 replies

1h6m

2024-03-05 17:19:59 UTC

The internet wouldn't have become as big as it is, if it wasn't for the internet's open source models.

The internet has been taken over by Capitalist and have ruined the internet, in my opinion.

londons_explore

7 replies

5h30m

2024-03-05 12:55:17 UTC

But they used to let you download the model weights to run on your own machine... But stable diffusion 3 is just in 'limited preview' with no public download links.

One has to wonder, why the delay?

Sharlin

4 replies

5h4m

2024-03-05 13:21:29 UTC

How is a closed beta anything out of the ordinary? They know they would only get tons of shit flinged at them if they publicly released something beta-quality, even if clearly labeled as such. SD users can be a VERY entitled bunch.

causal

1 replies

2h44m

2024-03-05 15:41:56 UTC

I've noticed a strange attitude of entitlement that seems to scale with how open a company is - Mistral and Stable Diffusion are on very sensitive ground with the open source community despite being the most open.

idle_zealot

0 replies

35m

2024-03-05 17:50:23 UTC

If you try to court a community then it will expect more of you. Same as if you were to claim to be an environmentalist company then you would receive more scrutiny from environmentalists confirming your claims.

Filligree

1 replies

4h3m

2024-03-05 14:22:34 UTC

Moreover, people would start training on the beta model, splitting the ecosystem if it doesn't die entirely. There's nothing good in that timeline.

Sharlin

0 replies

3h16m

2024-03-05 15:09:17 UTC

Uff, that's a good point.

nuz

0 replies

5h15m

2024-03-05 13:10:29 UTC

Both SD1.4 and SDXL was in limited preview for a few months before a public release. This has been their normal course of business for about 2 years now (since founding). They just do this to improve the weights via a beta test with less judgemental users before official release.

cthalupa

0 replies

5h25m

2024-03-05 13:00:23 UTC

That's nothing new with Stability. Even 1.5 was "released early" by RunwayML because they felt Stability was taking too long to release the weights instead of just providing them in DreamStudio.

Stability will release them in the coming weeks.

caycep

0 replies

45m

2024-03-05 17:41:06 UTC

are they still the commercial affiliate of the CompVis group at Ludwig Maximillian University?

whywhywhywhy

8 replies

8h2m

2024-03-05 10:23:38 UTC

It's impressive that it spell words correctly and lay them out but the issue I have is the text always has this distinctively overly fried look to it. The color of the text is always ramped up to a single value which when placed into a high fidelity image gives the impression of just slapping some text on top with photoshop afterwards in quite an amateurish fashion rather than text properly integrated into an image.

bsenftner

2 replies

5h57m

2024-03-05 12:29:07 UTC

I'm expecting at some point the stable diffusion community of developers to recognize the value of Layered Diffusion, the method of generating elements with transparent backgrounds, and transitioning to outputs that are layered images one may access and tweak independently. The addition of that would make the hands-on media producers of the world say "okay, now we're talking, finally directly indigestible into our existing production pipelines."

cthalupa

1 replies

5h23m

2024-03-05 13:02:11 UTC

There's already ComfyUI nodes for Layered Diffusion. https://github.com/huchenlei/ComfyUI-layerdiffuse

Of the people I know in the CG industries using SD in any sort of pipeline, they're all using Comfy because a node based workflow is what they're used to from things like Substance Designer, Houdini, Nuke, Blender, etc.

bsenftner

0 replies

5h3m

2024-03-05 13:22:42 UTC

That's my impression as well. I used to work in VFX, my entire career is pretty much 3D something or other, over and over.

viraptor

1 replies

6h44m

2024-03-05 11:41:45 UTC

It's just the presented examples. See the first preview for examples of properly integrated text https://stability.ai/news/stable-diffusion-3

Especially the side of the bus.

MyFirstSass

0 replies

4h53m

2024-03-05 13:33:09 UTC

Side of the bus still looks weird though. Like some lower resolution layer was transformed to the side, also still to bright.

Still impressive we've come to this but we're now in the uncanny valley which is a badge of honour tbh.

imiric

1 replies

7h51m

2024-03-05 10:34:59 UTC

But the sample images they show here showcase a good job at blending text with the rest of the image, using the correct art style, composition, shading and perspective. It seems like an improvement, no?

blehn

0 replies

3h54m

2024-03-05 14:31:19 UTC

The blending looks better, but the LED sign on the bus for example looks almost like handwritten lettering... the letters are all different heights and widths. Not even close to realistic. There's a lot of nuance that goes into getting these things right. It seems like it'll be stuck in an uncanny valley for a long time.

GaggiX

0 replies

5h13m

2024-03-05 13:12:37 UTC

It's very likely an artifact of CFG (classifier-free guidance), hopefully some days will be able to ditch this kinda dubious trick.

This is also the reason why the generated images have this characteristic high contrast and saturation. Better models usually need to rely less on CFG to generate coherent images because they fit the training distribution better.

finnjohnsen2

6 replies

7h43m

2024-03-05 10:42:22 UTC

Question is, will SD3 be downloadable? I downloaded and run the early SD locally and it is really great.

Or did we lose Stable Diffusion to SAAS also? Like we did on many of the LLMs which started of so promising as for self hosting goes

sen

3 replies

7h40m

2024-03-05 10:46:09 UTC

Sounds like it’ll be downloadable. FTA:

In early, unoptimized inference tests on consumer hardware our largest SD3 model with 8B parameters fits into the 24GB VRAM of a RTX 4090 and takes 34 seconds to generate an image of resolution 1024x1024 when using 50 sampling steps. Additionally, there will be multiple variations of Stable Diffusion 3 during the initial release, ranging from 800m to 8B parameter models to further eliminate hardware barriers.

nuz

1 replies

5h4m

2024-03-05 13:21:34 UTC

The 800m model is super exciting

jncfhnb

0 replies

1h59m

2024-03-05 16:26:23 UTC

It will probably suck. These models aren’t quite good enough for most tasks (other than toy fun exploration). They’re close in the sense that you can get there with a lot of work and dice rolling. But I would be pessimistic about a smaller model actually getting you where you want.

finnjohnsen2

0 replies

7h37m

2024-03-05 10:48:19 UTC

Thanks for pointing that out. Super promising.

emadm

0 replies

2h19m

2024-03-05 16:06:19 UTC

Yeah will all be downloadable weights. 800m, 2b and 8b currently planned.

Mashimo

0 replies

5h31m

2024-03-05 12:54:53 UTC

It looks like it. I really hope they do. Running SDXL right now is propper fun. I don't even use it for anything specific, just to amuse myself at times :D

TheAceOfHearts

4 replies

7h3m

2024-03-05 11:23:10 UTC

It's very exciting to see that image generators are finally figuring out spelling. When DALL-E 3 (?) came out they hyped up spelling capabilities but when I tried it with Bing it was incredibly inconsistent.

I'd love to read a less technical writeup explaining the challenges faced and why it took so long to figure out spelling. Scrolling through the paper is a bit overwhelming and it goes beyond my current understanding of the topic.

Does anyone know if it would be possible to eventually take older generated images with garbled up text + their prompt and have SD3 clean it up or fix the text issues?

vergessenmir

0 replies

6h58m

2024-03-05 11:27:59 UTC

I would imagine with an img2img workflow it would be. The same way you can reconstruct a badly rendered face but doing a second pass on the affected region

mise1

0 replies

2024-03-05 18:19:12 UTC

It's a surprisingly difficult task with quite a bit of research history.

We looked at different solutions extensively (https://medium.com/towards-data-science/editing-text-in-imag...) and ended up building a tool to solve the problem: https://www.producthunt.com/posts/textify-2

Eventually models will get to the point where they can do this well natively but for now the best we can do is a post-processing step.

emadm

0 replies

2h19m

2024-03-05 16:06:47 UTC

Yes, this is possible, we have ComfyUI workflows for this.

declaredapple

0 replies

5h14m

2024-03-05 13:11:48 UTC

The best way to do it right now is controlnets.

I'm not sure about re-doing the text of old images - you could try img2img but coherence is an issue, more controlnets might help

edshiro

2 replies

3h19m

2024-03-05 15:06:50 UTC

This is really exciting to see. I applaud Stability AI's commitment to open source and hope they can operate for as long as possible.

There was one thing I was curious about... I skimmed through the executive summary of the paper but couldn't find it. Does Stable Diffusion 3 still use CLIP from Open AI for tokenization and text embeddings? I would naively assume that they would try to improve on this part of the model's architecture to improve adherence to text and image prompts.

ollin

0 replies

2h36m

2024-03-05 15:49:57 UTC

They use three text encoders to encode the caption:

1. CLIP-G/14 (OpenCLIP)

2. CLIP-L/14 (OpenAI)

3. T5-v1.1-XXL (Google)

They randomly disable encoders during training, so that when generating images SD3 can use any subset of the 3 encoders. They find that using T5 XXL is important only when generating images from prompts with "either highly detailed descriptions of a scene or larger amounts of written text".

MrCheeze

0 replies

3h6m

2024-03-05 15:20:10 UTC

One of the diagrams says they're using CLIP-G/14 and CLIP-L/14, which are the names of two OpenCLIP models - meaning they're not using OpenAI's CLIP.

vessenes

0 replies

2h42m

2024-03-05 15:44:09 UTC

This looks great, very exciting. The paper is not a lot more detailed than the blog. The main Thing about the paper is they have an architecture that can include more expressive text encoders (t5-xxl here), they show this helps with complex scenes, and it seems clear they haven’t maxed out this stack in terms of training. So, expect sd3.1 to be better than this, and expect 4 to be able to work with video through adding even more front end encoding. Exciting!

nojvek

0 replies

49m

2024-03-05 17:36:30 UTC

He! in contrast to Stability AI, Open AI is the least closed AI lab. Even Deep Mind publishes more papers.

I wonder if anyone in Open AI openly says it "We're in for the money!"

The recent letter by SamA regarding Elon's trial had as much truth as Putin saying they are invading Ukraine for de-nazification.