More and more companies that were once devoted to being 'open', or were previously open, are now becoming increasingly closed. I appreciate Stability AI releases these research papers.
It's impressive that it spell words correctly and lay them out but the issue I have is the text always has this distinctively overly fried look to it. The color of the text is always ramped up to a single value which when placed into a high fidelity image gives the impression of just slapping some text on top with photoshop afterwards in quite an amateurish fashion rather than text properly integrated into an image.
I'm expecting at some point the stable diffusion community of developers to recognize the value of Layered Diffusion, the method of generating elements with transparent backgrounds, and transitioning to outputs that are layered images one may access and tweak independently. The addition of that would make the hands-on media producers of the world say "okay, now we're talking, finally directly indigestible into our existing production pipelines."
There's already ComfyUI nodes for Layered Diffusion. https://github.com/huchenlei/ComfyUI-layerdiffuse
Of the people I know in the CG industries using SD in any sort of pipeline, they're all using Comfy because a node based workflow is what they're used to from things like Substance Designer, Houdini, Nuke, Blender, etc.
That's my impression as well. I used to work in VFX, my entire career is pretty much 3D something or other, over and over.
It's just the presented examples. See the first preview for examples of properly integrated text https://stability.ai/news/stable-diffusion-3
Especially the side of the bus.
Side of the bus still looks weird though. Like some lower resolution layer was transformed to the side, also still to bright.
Still impressive we've come to this but we're now in the uncanny valley which is a badge of honour tbh.
But the sample images they show here showcase a good job at blending text with the rest of the image, using the correct art style, composition, shading and perspective. It seems like an improvement, no?
The blending looks better, but the LED sign on the bus for example looks almost like handwritten lettering... the letters are all different heights and widths. Not even close to realistic. There's a lot of nuance that goes into getting these things right. It seems like it'll be stuck in an uncanny valley for a long time.
It's very likely an artifact of CFG (classifier-free guidance), hopefully some days will be able to ditch this kinda dubious trick.
This is also the reason why the generated images have this characteristic high contrast and saturation. Better models usually need to rely less on CFG to generate coherent images because they fit the training distribution better.
Question is, will SD3 be downloadable? I downloaded and run the early SD locally and it is really great.
Or did we lose Stable Diffusion to SAAS also? Like we did on many of the LLMs which started of so promising as for self hosting goes
Sounds like it’ll be downloadable. FTA:
In early, unoptimized inference tests on consumer hardware our largest SD3 model with 8B parameters fits into the 24GB VRAM of a RTX 4090 and takes 34 seconds to generate an image of resolution 1024x1024 when using 50 sampling steps. Additionally, there will be multiple variations of Stable Diffusion 3 during the initial release, ranging from 800m to 8B parameter models to further eliminate hardware barriers.
The 800m model is super exciting
It will probably suck. These models aren’t quite good enough for most tasks (other than toy fun exploration). They’re close in the sense that you can get there with a lot of work and dice rolling. But I would be pessimistic about a smaller model actually getting you where you want.
Thanks for pointing that out. Super promising.
Yeah will all be downloadable weights. 800m, 2b and 8b currently planned.
It looks like it. I really hope they do. Running SDXL right now is propper fun. I don't even use it for anything specific, just to amuse myself at times :D
It's very exciting to see that image generators are finally figuring out spelling. When DALL-E 3 (?) came out they hyped up spelling capabilities but when I tried it with Bing it was incredibly inconsistent.
I'd love to read a less technical writeup explaining the challenges faced and why it took so long to figure out spelling. Scrolling through the paper is a bit overwhelming and it goes beyond my current understanding of the topic.
Does anyone know if it would be possible to eventually take older generated images with garbled up text + their prompt and have SD3 clean it up or fix the text issues?
I would imagine with an img2img workflow it would be. The same way you can reconstruct a badly rendered face but doing a second pass on the affected region
It's a surprisingly difficult task with quite a bit of research history.
We looked at different solutions extensively (https://medium.com/towards-data-science/editing-text-in-imag...) and ended up building a tool to solve the problem: https://www.producthunt.com/posts/textify-2
Eventually models will get to the point where they can do this well natively but for now the best we can do is a post-processing step.
Yes, this is possible, we have ComfyUI workflows for this.
The best way to do it right now is controlnets.
I'm not sure about re-doing the text of old images - you could try img2img but coherence is an issue, more controlnets might help
This is really exciting to see. I applaud Stability AI's commitment to open source and hope they can operate for as long as possible.
There was one thing I was curious about... I skimmed through the executive summary of the paper but couldn't find it. Does Stable Diffusion 3 still use CLIP from Open AI for tokenization and text embeddings? I would naively assume that they would try to improve on this part of the model's architecture to improve adherence to text and image prompts.
They use three text encoders to encode the caption:
1. CLIP-G/14 (OpenCLIP)
2. CLIP-L/14 (OpenAI)
3. T5-v1.1-XXL (Google)
They randomly disable encoders during training, so that when generating images SD3 can use any subset of the 3 encoders. They find that using T5 XXL is important only when generating images from prompts with "either highly detailed descriptions of a scene or larger amounts of written text".
One of the diagrams says they're using CLIP-G/14 and CLIP-L/14, which are the names of two OpenCLIP models - meaning they're not using OpenAI's CLIP.
This looks great, very exciting. The paper is not a lot more detailed than the blog. The main Thing about the paper is they have an architecture that can include more expressive text encoders (t5-xxl here), they show this helps with complex scenes, and it seems clear they haven’t maxed out this stack in terms of training. So, expect sd3.1 to be better than this, and expect 4 to be able to work with video through adding even more front end encoding. Exciting!
He! in contrast to Stability AI, Open AI is the least closed AI lab. Even Deep Mind publishes more papers.
I wonder if anyone in Open AI openly says it "We're in for the money!"
The recent letter by SamA regarding Elon's trial had as much truth as Putin saying they are invading Ukraine for de-nazification.
It's hard to build a business on "open". I'm not sure what Stability AI's long term direction will be, but I hope they do figure out a way to become profitable while creating these free models.
Maybe not everything should be about business.
Agreed but this isn't the same as an open source library; it costs A LOT of money to constantly train these models. That money has to come from somewhere, unfortunately.
Yeah. The amount of compute required is pretty high. I wonder, is there enough distributed compute available to bootstrap a truly open model through a system like seti@home or folding@home?
The compute exists, but we'd need some conceptual breakthroughs to make DNN training over high-latency internet links make sense.
Distributing the training data also opens up vectors of attack. Poisoning or biasing the dataset distributed to the computer needs to be guarded against... but I don't think that's actually possible in a distributed model (in principal?). If the compute is happing off server: then trust is required (which is not {efficiently} enforceable?).
Trust is kinda a solved problem in distributed computing, The different "@Home" projects and Bitcoin handle this by requiring multiple validations of a block of work for just this reason.
Forward-Forward looked promising, but then Hinton got the AI-Doomer heebie-jeebies and bailed. Perhaps someone picks up the concept and runs with it - I'd love to myself but I don't have the skillz to build stuff at that depth, yet.
I agree, but Y-Combinator literally only exists to squeeze the most bizness out of young smart people. That's why you're not seeing so much agreement.
YC started out with the intent to give young smart people a shot at starting a business. IMHO it has shifted significantly over the years to more what you say. We see ads now seeking a "founding engineer" for YC startups, but it used to be the founders were engineers.
The choice facing many companies that insist on remaining "open" is:
Do you want to 1. be right
or
2. stay in business
This is one of the reasons why OpenAI pivoted to be closed. Not bc of greedy value extractors; because it was the only way to survive.
That was basically why openai was founded.
Too bad they decided to get greedy :-(
Great, but aren't they simultaneously losing money and getting sued?
Maybe. Paychecks help with not being hungry, though.
I’d be happy if my government or EU or whatever offered cash grants for open research and open weights in AI space.
The problem is, everyone wants to be a billionaire over there and it’s getting crowded.
Maybe, but in image generation it's also hard to be closed.
The big providers are all so terrified they'll produce a deepfake image of obama getting arrested or something, the models are so locked down they only seem capable of producing stock photos.
I think the content they are worried about is far darker than an attempt to embarrass a former president.
The internet wouldn't have become as big as it is, if it wasn't for the internet's open source models.
The internet has been taken over by Capitalist and have ruined the internet, in my opinion.
But they used to let you download the model weights to run on your own machine... But stable diffusion 3 is just in 'limited preview' with no public download links.
One has to wonder, why the delay?
How is a closed beta anything out of the ordinary? They know they would only get tons of shit flinged at them if they publicly released something beta-quality, even if clearly labeled as such. SD users can be a VERY entitled bunch.
I've noticed a strange attitude of entitlement that seems to scale with how open a company is - Mistral and Stable Diffusion are on very sensitive ground with the open source community despite being the most open.
If you try to court a community then it will expect more of you. Same as if you were to claim to be an environmentalist company then you would receive more scrutiny from environmentalists confirming your claims.
Moreover, people would start training on the beta model, splitting the ecosystem if it doesn't die entirely. There's nothing good in that timeline.
Uff, that's a good point.
Both SD1.4 and SDXL was in limited preview for a few months before a public release. This has been their normal course of business for about 2 years now (since founding). They just do this to improve the weights via a beta test with less judgemental users before official release.
That's nothing new with Stability. Even 1.5 was "released early" by RunwayML because they felt Stability was taking too long to release the weights instead of just providing them in DreamStudio.
Stability will release them in the coming weeks.
are they still the commercial affiliate of the CompVis group at Ludwig Maximillian University?