return to table of content

Stable Cascade

yogorenapan
41 replies
1d

Very impressive.

From what I understand, Stability AI is currently VC funded. It’s bound to burn through tons of money and it’s not clear whether the business model (if any) is sustainable. Perhaps worthy of government funding.

minimaxir
33 replies
1d

Stability AI has been burning through tons of money for awhile now, which is the reason newer models like Stable Cascade are not commercially-friendly-licensed open source anymore.

The company is spending significant amounts of money to grow its business. At the time of its deal with Intel, Stability was spending roughly $8 million a month on bills and payroll and earning a fraction of that in revenue, two of the people familiar with the matter said.

It made $1.2 million in revenue in August and was on track to make $3 million this month from software and services, according to a post Mostaque wrote on Monday on X, the platform formerly known as Twitter. The post has since been deleted.

https://fortune.com/2023/11/29/stability-ai-sale-intel-ceo-r...

littlestymaar
30 replies
1d

which is the reason newer models like Stable Cascade are not commercially-friendly-licensed open source anymore.

The main reason is probably Mid journey and OpenAi using their tech without any kind of contribution back. AI desperately needs a GPL equivalent…

ipsum2
17 replies
23h56m

It's highly doubtful that Midjourney and OpenAI use Stable Diffusion or other Stability models.

jonplackett
13 replies
23h48m

How do you know though?

minimaxir
12 replies
23h44m

You can't use off-the-shelf models to get the results Midjourney and DALL-E generate, even with strong finetuning.

cthalupa
9 replies
23h25m

I pay for both MJ and DALL-E (though OpenAI mostly gets my money for GPT) and don't find them to produce significantly better images than popular checkpoints on CivitAI. What I do find is that they are significantly easier to work with. (Actually, my experience with hundreds of DALL-E generations is that it's actually quite poor in quality. I'm in several IRC channels where it's the image generator of choice for some IRC bots, and I'm never particularly impressed with the visual quality.)

For MJ in particular, knowing that they at least used to use Stable Diffusion under the hood, it would not surprise me if the majority of the secret sauce is actually a middle layer that processes the prompt and converts it to one that is better for working with SD. Prompting SD to get output at the MJ quality level takes significantly more tokens, lots of refinement, heavy tweaking of negative prompting, etc. Also a stack of embeddings and LoRAs, though I would place those more in the category of finetuning like you had mentioned.

millgrove
4 replies
22h0m

What do you use it for? I haven't found a great use for it myself (outside of generating assets for landing pages / apps, where it's really really good). But I have seen endless subreddits / instagram pages dedicated to various forms of AI content, so it seems lots of people are using it for fun?

cthalupa
3 replies
21h8m

Nothing professional. I run a variety of tabletop RPGs for friends, so I mostly use it for making visual aids there. I've also got a large format printer that I was no longer using for it's original purpose, so I bought a few front-loading art frames that I generate art for and rotate through periodically.

I've also used it to generate art for deskmats I got printed at https://specterlabs.co/

For commercial stuff I still pay human artists.

throwanem
2 replies
18h29m

Whose frames do you use? Do you like them? I print my photos to frame and hang, and wouldn't at all mind being able to rotate them more conveniently and inexpensively than dedicating a frame to each allows.

cthalupa
1 replies
14h38m

https://www.spotlightdisplays.com/

I like them quite a bit, and you can get basically any size cut to fit your needs even if they don't directly offer it on the site.

throwanem
0 replies
14h23m

Perfectly suited to go alongside the style of frame I already have lots of, and very reasonably priced off the shelf for the 13x19 my printer tops out at. Thanks so much! It'll be easier to fill that one blank wall now.

soultrees
1 replies
21h46m

What IRC Channels do you frequent?

cthalupa
0 replies
21h12m

Largely some old channels from the 90s/00s that really only exist as vestiges of their former selves - not really related to their original purpose, just rooms for hanging out with friends made there back when they had a point besides being a group chat.

emadm
1 replies
23h7m

If you try diffusionGPT with regional prompting added and a GAN corrector you can get a good idea of what is possible https://diffusiongpt.github.io

euazOn
0 replies
21h10m

That looks very impressive unless the demo is cherrypicked, would be great if this could be implemented into a frontend like Fooocus https://github.com/lllyasviel/Fooocus

yreg
0 replies
21h10m

That's not really true, MJ and DALL-E are just more beginner friendly.

orbital-decay
0 replies
16h24m

Midjourney has absolutely nothing to offer compared to proper finetunes. DALL-E has: it generalizes well (can make objects interact properly for example) and has great prompt adherence. But it can also be unpredictable as hell because it rewrites the prompts. DALL-E's quality is meh - it has terrible artifacts on all pixel-sized details, hallucinations on small details, and limited resolution. Controlnets, finetuning/zero-shot reference transfer, and open tooling would have made a beast of a model of it, but they aren't available.

cthalupa
1 replies
23h42m

Midjourney 100% at least used to use Stable Diffusion: https://twitter.com/EMostaque/status/1561917541743841280

I am not sure if that is still the case.

refulgentis
0 replies
23h40m

It trialled it as an explicitly optional model for a moment a couple years ago. (or only a year? time moves so fast. somewhere in v2/v3 timeframe and around when SD came out). I am sure it is no longer the case.

liuliu
0 replies
22h46m

DALL-E shares the same autoencoders as SD v1.x. It is probably similar to how Meta's Emu-class models work though. They tweaked the architecture quite a bit, trained on their own dataset, reused some components (or in Emu case, trained all the components from scratch but reused the same arch).

yogorenapan
8 replies
23h57m

AI desperately needs a GPL equivalent

Why not just the GPL then?

loudmax
7 replies
23h20m

The GPL was intended for computer code that gets compiled to a binary form. You can share the binary, but you also have to share the code that the binary is compiled from. Pre-trained model weights might be thought of as analogous to compiled code, and the training data may be analogous to program code, but they're not the same thing.

The model weights are shared openly, but the training data used to create these models isn't. This is at least partly because all these models, including OpenAI's, are trained on copyrighted data, so the copyright status of the models themselves is somewhat murky.

In the future we may see models that are 100% trained in the open, but foundational models are currently very expensive to train from scratch. Either prices would need to come down, or enthusiasts will need some way to share radically distributed GPU resources.

emadm
5 replies
23h9m

Tbh I think these models will largely be trained on synthetic datasets in the future. They are mostly trained on garbage now. We have been doing opt outs on these, has been interesting to see quality differential (or lack thereof), eg removing books3 from stableLM 3b zephyr https://stability.wandb.io/stability-llm/stable-lm/reports/S...

sillysaurusx
2 replies
19h41m

I’ve wondered whether books3 makes a difference, and how much. If you ever train a model with a proper books3 ablation I’d be curious to know how it does. Books are an important data source, but if users find the model useful without them then that’s a good datapoint.

emadm
1 replies
18h14m

We did try stableLM 3b4 with books3 and it got worse in general and benchmarks

Just did some pes2o ablations too which were eh

sillysaurusx
0 replies
17h42m

What I mean is, it’s important to train a model with and without books3. That’s the only way to know whether it was books3 itself causing the issue, or some artifact of the training process.

One thing that’s hard to measure is the knowledge contained in books3. If someone asks about certain books, it won’t be able to give an answer unless the knowledge is there in some form. I’ve often wondered whether scraping the internet is enough rather than training on books directly.

But be careful about relying too much on evals. Ultimately the only benchmark that matters is whether users find the model useful. The clearest test of this would be to train two models side by side, with and without books3, and then ask some people which they prefer.

It’s really tricky to get all of this right. But if there’s more details on the pes2o ablations I’d be curious to see.

keenmaster
1 replies
20h41m

Why aren’t the big models trained on synthetic datasets now? What’s the bottleneck? And how do you avoid amplifying the weaknesses of LLMs when you train on LLM output vs. novel material from the comparatively very intelligent members of the human species. Would be interesting to see your take on this.

emadm
0 replies
14h20m

We are starting to see that, see phi2 for example

There are approaches to get the right type of augmented and generated data to feed these models right, check out our QDAIF paper we worked on for example

https://arxiv.org/pdf/2310.13032.pdf

protomikron
0 replies
21h24m

What about CC licenses for model weights? It's common for files ("images", "video", "audio", ...) So maybe appropriate.

thatguysaguy
0 replies
20h42m

The net flow of knowledge about text-to-image generation from OpenAI has definitely been outward. The early open source methods used CLIP, which OpenAI came up with. Dall-e (1) was also the first demonstration that we could do text to image at all. (There were some earlier papers which could give you a red splotch if you said stop sign or something years earlier).

programjames
0 replies
23h4m

I think it'd be interesting to have a non-profit "model sharing" platform, where people can buy/sell compute. When you run someone's model, they get royalties on the compute you buy.

minimaxir
0 replies
23h54m

More specifically, it's so Stability AI can theoretically make a business on selling commercial access to those models through a membership: https://stability.ai/news/introducing-stability-ai-membershi...

loudmax
1 replies
23h6m

I get the impression that a lot of open source adjacent AI companies, including Stability AI, are in the "???" phase of execution, hoping the "Profit" phase comes next.

Given how much VC money is chasing the AI space, this isn't necessarily a bad plan. Give stuff away for free while developing deep expertise, then either figure out something to sell, or pivot to proprietary, or get aquihired by a tech giant.

minimaxir
0 replies
22h59m

That is indeed the case, hence the more recent pushes toward building moats by every AI company.

sveme
1 replies
23h41m

None of the researchers are associated with stability.ai, but with universities in Germany and Canada. How does this work? Is this exclusive work for stability.ai?

emadm
0 replies
23h18m

Dom and Pablo both work for Stability AI (Dom finishing his degree).

All the original Stable Diffusion researchers (Robin Rombach, Patrick Esser, Dominik Lorenz, Andreas Blattman) also work for Stability AI.

seydor
1 replies
23h50m

exactly my thought. stability should be receiving research grants

emadm
0 replies
23h10m

We should, we haven't yet...

Instead we've given 10m+ supercomputer hours in grants to all sorts of projects, now we have our grant team in place & there is a huge increase in available funding for folk that can actually build stuff we can tap into.

diggan
1 replies
23h39m

I've seen Emad (Stability AI founder) commenting here on HN somewhere about this before, what exactly their business model is/will be, and similar thoughts.

HN search doesn't seem to agree with me today though and I cannot find the specific comment/s I have in mind, maybe someone else has any luck? This is their user https://news.ycombinator.com/user?id=emadm

emadm
0 replies
23h12m

https://x.com/EMostaque/status/1649152422634221593?s=20

We now have top models of every type, sites like www.stableaudio.com, memberships, custom model deals etc so lots of demand

We're the only AI company that can make a model of any type for anyone from scratch & are the most liked / one of the most downloaded on HuggingFace (https://x.com/Jarvis_Data/status/1730394474285572148?s=20, https://x.com/EMostaque/status/1727055672057962634?s=20)

Its going ok, team working hard and shipping good models, the team are accelerating their work on building ComfyUI to bring it all together.

My favourite recent model was CheXagent, I think medical models should be open & will really save lives: https://x.com/Kseniase_/status/1754575702824038717?s=20

downrightmike
0 replies
23h43m

Finally a good use to burn VC money!

obviyus
35 replies
1d

Been using it for a couple of hours and it seems it’s much better at following the prompt. Right away it seems the quality is worse compared to some SDXL models but I’ll reserve judgement until a couple more days of testing.

It’s fast too! I would reckon about 2-3x faster than non-turbo SDXL.

kimoz
14 replies
1d

Can one run it on CPU?

rwmj
11 replies
1d

Stable Diffusion on a 16 core AMD CPU takes for me about 2-3 hours to generate an image, just to give you a rough idea of the performance. (On the same AMD's iGPU it takes 2 minutes or so).

OJFord
4 replies
1d

Even older GPUs are worth using then I take it?

For example I pulled a (2GB I think, 4 tops) 6870 out of my desktop because it's a beast (in physical size, and power consumption) and I wasn't using it for gaming or anything, figured I'd be fine just with the Intel integrated graphics. But if I wanted to play around with some models locally, it'd be worth putting it back & figuring out how to use it as a secondary card?

rwmj
1 replies
23h52m

One counterintuitive advantage of the integrated GPU is it has access to system RAM (instead of using a dedicated and fixed amount of VRAM). That means I'm able to give the iGPU 16 GB of RAM. For me SD takes 8-9 GB of RAM when running. The system RAM is slower than VRAM which is the trade-off here.

OJFord
0 replies
23h41m

Yeah I did wonder about that as I typed, which is why I mentioned the low amount (by modern standards anyway) on the card. OK, thanks!

purpleflame1257
0 replies
21h48m

2GB is really low. I've been able to use A111 stable diffusion on my old gaming laptop's 1060 (6GB VRAM) and it takes a little bit less than a minute to generate an image. You would probably need to try the --lowvram flag on startup.

mat0
0 replies
23h46m

No, I don't think so. I think you would need more VRAM to start with.

smoldesu
1 replies
23h49m

SDXL Turbo is much better, albeit kinda fuzzy and distorted. I was able to get decent single-sample response times (~80-100s) from my 4 core ARM Ampere instance, good enough for a Discord bot with friends.

emadm
0 replies
21h43m

Sd turbo runs nicely on a m2 MacBook Air (as does stable lm 2!)

Much faster models will come

antman
1 replies
9h29m

Which AMD CPU/iGPU are these timings for?

rwmj
0 replies
7h48m

AMD Ryzen 9 7950X 16-Core Processor

The iGPU is gfx1036 (RDNA 2).

weebull
0 replies
2h50m

WTF!

On my 5900X, so 12 cores, I was able to get SDXL to around 10-15 minutes. I did do a few things to get to that.

1. I used an AMD Zen optimised BLAS library. In particular the AMDBLIS one, although it wasn't that different to the Intel MKL one.

2. I preload the jemalloc library to get better aligned memory allocations.

3. I manually set the number of threads to 12.

This is the start of my ComfyUI CPU invocation script.

    export OMP_NUM_THREADS=12
    export LD_PRELOAD=/opt/aocl/4.1.0/aocc/lib_LP64/libblis-mt.so:$LD_PRELOAD
    export LD_PRELOAD=/usr/lib/libjemalloc.so:$LD_PRELOAD
    export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000"
Honestly, 12 threads wasn't much better than 8, and more than 12 was detrimental. I was memory bandwidth limited I think, not compute.

adrian_b
0 replies
23h10m

If that is true, then the CPU variant must be a much worse implementation of the algorithm than the GPU variant, because the true ratio of the GPU and CPU performances is many times less than that.

sebzim4500
0 replies
1d

Not if you want to finish the generation before you have stopped caring about the results.

ghurtado
0 replies
1d

You can run any ML model on CPU. The question is the performance

sorenjan
13 replies
23h40m

How much VRAM does it need? They mention that the largest model uses 1.4 billion parameters more than SDXL, which in turn need a lot of VRAM.

adventured
8 replies
23h18m

There was a leak from Japan yesterday, prior to this release, and in that it was suggested 20gb for the largest model.

This text was part of the Stability Japan leak (the 20gb VRAM reference was dropped in the release today):

"Stages C and B will be released in two different models. Stage C uses parameters of 1B and 3.6B, and Stage B uses parameters of 700M and 1.5B. However, if you want to minimize your hardware needs, you can also use the 1B parameter version. In Stage B, both give great results, but 1.5 billion is better at reconstructing finer details. Thanks to Stable Cascade's modular approach, the expected amount of VRAM required for inference can be kept at around 20GB, but can be reduced even further by using smaller variations (as mentioned earlier, this (which may reduce the final output quality)."

sorenjan
7 replies
23h6m

Thanks. I guess this means that fewer people will be able to use it on their own computer, but the improved efficiency makes it cheaper to run on servers with enough VRAM.

Maybe running stage C first, unloading it from VRAM, and then do B and A would make it fit in 12 or even 8 GB, but I wonder if the memory transfers would negate any time saving. Might still be worth it if it produces better images though.

whywhywhywhy
2 replies
5h30m

If you're serious about doing image gen locally you should be running a 24GB card anyway because honestly Nvidia's current generation 24GB is the sweet spot price to performance. 3080 ram is laughably the same as the 6 year old 1080Ti and 4080 ram is only slightly more at 16 and costs about 1.5 times the 3090 second hand.

Any speed benefits of the 4080 are gonna be worthless the second it has to cycle a model in and out of ram anyway vs the 3090 in image gen.

weebull
1 replies
3h3m

because honestly Nvidia's current generation 24GB is the sweet spot price to performance

How is the halo product of a range the "sweet spot"?

I think nVidia are extremely exposed on this front. The RX 7900XTX is also 24GB and under half the price (In UK at least - £800 vs £1,700 for the 4090). It's difficult to get a performance comparison on compute tasks, but I think it's around 70-80% of the 4090 given what I can find. Even a 3090, if you can find one, is £1,500.

The software isn't as stable on AMD hardware, but it does work. I'm running a RX7600 - 8GB myself, and happily doing SDXL. The main problem is that exhausting VRAM causes instability. Exceed it by a lot, and everything is handled fine, but if it's marginal... problems ensue.

The AMD engineers are actively making the experience better, and it may not be long before it's a practical alternative. If/When that happens nVidia will need to slash their prices to sell anything in this sphere, which I can't really see themselves doing.

zargon
0 replies
1h35m

If/When that happens nVidia will need to slash their prices to sell anything in this sphere

It's just as likely that AMD will raise prices to compensate.

adventured
1 replies
22h55m

If it worked I imagine large batching could make it worth the load/unload time cost.

weebull
0 replies
3h1m

Shouldn't be a reason you couldn't do a ton of Layer C work on different images, and then swap in Layer B.

Filligree
1 replies
22h52m

Sequential model offloading isn’t too bad. It adds about a second or less to inference, assuming it still fits in main memory.

sorenjan
0 replies
22h40m

Sometimes I forget how fast modern computers are. PCIe v4 x16 has a transfer speed of 31.5 GB/s, so theoretically it should take less than 100 ms to transfer stage B and A. Maybe it's not so bad after all, it will be interesting to see what happens.

liuliu
3 replies
22h49m

Should use no more than 6GiB for FP16 models at each stage. The current implementation is not RAM optimized.

sorenjan
1 replies
22h46m

The large C model uses 3.6 billion parameters which is 6.7 GiB if each parameter is 16 bits.

liuliu
0 replies
22h41m

The large C model have fair bit of parameters tied to text-conditioning, not to the main denoising process. Similar to how we split the network for SDXL Base, I am pretty confident we can split non-trivial amount of parameters to text-conditioning hence during denoising process, loading less than 3.6B parameters.

brucethemoose2
0 replies
18h43m

What's more, they can presumably be swapped in and out like the SDXL base + refiner, right?

vergessenmir
5 replies
21h3m

I'll take prompt adherence over quality any day. The machinery otherwise isn't worth it i.e the controlnets, openpose, depthmaps just to force a particular look or to achieve depth. Th solution becomes bespoke for each generation.

Had a test of it and my option is it's an improvement when it comes to following prompts and I do find the images more visually appealing.

stavros
4 replies
20h38m

Can we use its output as input to SDXL? Presumably it would just fill in the details, and not create whole new images.

RIMR
3 replies
18h54m

I was thinking that exactly. You could use the same trick as the hires-fix for an adherence-fix.

emadm
2 replies
14h24m

Yeah chain it in comfy to a turbo model for detail

dragonwriter
0 replies
11m

For detail, it'd probably be better to use a full model with a small number of steps (something like KSampler Advanced node with 40 total steps, but starting at step 32-ish.) Might even try using the SDXL refiner model for that.

Turbo models are decent at low-iteration-decent-results, but not so much at adding fine details to an mostly-done image.

Filligree
0 replies
4h36m

A turbo model isn't the first thing I'd think of when it comes to finalizing a picture. Have you found one that produces high-quality output?

yogorenapan
11 replies
1d

I see in the commits that the license was changed from MIT to their own custom one: https://github.com/Stability-AI/StableCascade/commit/209a526...

Is it legal to use an older snapshot before the license was changed in accordance with the previous MIT license?

ed
4 replies
23h19m

It seems pretty clear the intent was to use a non-commercial license, so it’s probably something that would go to court, if you really wanted to press the issue.

Generally courts are more holistic and look at intent, and understand that clerical errors happen. One exception to this is if a business claims it relied on the previous license and invested a bunch of resources as a result.

I believe the timing of commits is pretty important— it would be hard to claim your business made a substantial investment on a pre-announcement repo that was only MIT’ed for a few hours.

RIMR
3 replies
18h33m

If I clone/fork that repo before the license change, and start putting any amount of time into developing my own fork in good faith, they shouldn't be allowed to claim a clerical error when they lied to me upon delivery about what I was allowed to do with the code.

Licenses are important. If you are going to expose your code to the world, make sure it has the right license. If you publish your code with the wrong license, you shouldn't be allowed to take it back. Not for an organization of this size that is going to see a new repo cloned thousands of times upon release.

wokwokwok
1 replies
10h46m

No, sadly this won’t fly in court.

For the same reason you cannot publish a private corporate repo with an MIT license and then have other people claim in “good faith” to be using it.

All they need is to assert that the license was published in error, or that the person publishing it did not have the authority to publish it.

You can’t “magically” make a license stick by putting it in a repo, any more than putting a “name here” sticker on someone’s car and then claiming to own it.

The license file in the repo is simply the notice of the license.

It does not indicate a binding legal agreement.

You of course, can challenge it in court, and ianal, but I assure you, there is president in incorrectly labelled repos removing and changing their licenses.

arcbyte
0 replies
2h16m

It could very well fly. Agency law, promissory estoppel, ...

ed
0 replies
18h22m

There’s no case law here, so if you’re volunteering to find out what a judge thinks we’d surely appreciate it!

treesciencebot
1 replies
1d

I think the model architecture (training code etc.) itself is still under MIT while the weights (which was the result of training in a huge GPU cluster as well as the dataset they have used [not sure if they publicly talked about it] is under this new license.

emadm
0 replies
23h12m

Code is MIT, weights are under the NC license for now.

RIMR
1 replies
18h37m

MIT license is not parasitic like GPL. You can close an MIT licensed codebase, but you cannot retroactively change the license of the old code.

Stability's initial commit had an MIT license, so you can fork that commit and do whatever you want with it. It's MIT licensed.

Now, the tricky part here is that they committed a change to the license that changes it from MIT to proprietary, but they didn't change any code with it. That is definitely invalid, because they cannot license the exact same codebase with two different contradictory licenses. They can only license the changes made to the codebase after the license change. I wouldn't call it "illegal", but it wouldn't stand up in court if they tried to claim that the software is proprietary, because they already distributed it verbatim with an open license.

kruuuder
0 replies
18h21m

they didn't change any code with it. That is definitely invalid, because they cannot license the exact same codebase with two different contradictory licenses.

Why couldn't they? Of course they can. If you are the copyright owner, you can publish/sell your stuff under as many licenses as you like.

weebull
0 replies
2h43m

The code is MIT. The model has a non-commercial license. They are separate pieces of work under different licenses. Stability AI have said that the non-commercial license is because this is a technology preview (like SDXL 0.9 was).

OJFord
0 replies
1d

Yes, you can continue to do what you want with that commit^ in accordance with the MIT licence it was released under. Kind of like if you buy an ebook, and then they publish a second edition but only as a hardback - the first edition ebook is still yours to read.

jedberg
8 replies
1d

I'd say I'm most impressed by the compression. Being able to compress an image 42x is huge for portable devices or bad internet connectivity (or both!).

incrudible
3 replies
1d

That is 42x spatial compression, but it needs 16 channels instead of 3 for RGB.

zamadatix
1 replies
21h44m

Even assuming 32 bit floats (the extra 4 on the end):

4*16*24*24*4 = 147,456

vs (removing the alpha channel as it's unused here)

3*3*1024*1024 = 9,437,184

Or 1/64 raw size, assuming I haven't fucked up the math/understanding somewhere (very possible at the moment).

incrudible
0 replies
1h28m

It is actually just 2/4 bytes x 16 latent channels x 24 x 24, but the comparison to raw data needs to be taken with a grain of salt, as there is quite a bit of hallucination involved in reconstruction.

ansk
0 replies
22h42m

Furthermore, each of those 16 channels would typically be mutibyte floats as opposed to single byte RGB channels. (speaking generally, haven't read the paper)

seanalltogether
2 replies
22h5m

I have to imagine at this point someone is working toward a fast AI based video codec that comes with a small pretrained model and can operate in a limited memory environment like a tv to offer 8k resolution with low bandwidth.

jedberg
0 replies
21h59m

I would be shocked if Netflix was not working on that.

Lord-Jobo
0 replies
17h47m

I am 65% sure this is already extremely similar to LGs upscaling approach in their most recent flagship

flgstnd
0 replies
23h50m

a 42x compression is also impressive as it matches the answer to the ultimate question of life, the universe, and everything, maybe there is some deep universal truth within this model.

k2enemy
7 replies
22h2m

I haven't been following the image generation space since the initial excitement around stable diffusion. Is there an easy to use interface for the new models coming out?

I remember setting up the python env for stable diffusion, but then shortly after there were a host of nice GUIs. Are there some popular GUIs that can be used to try out newer models? Similarly, what's the best GUI for some of the older models? Preferably for macos.

thot_experiment
2 replies
21h57m

Auto1111 and Comfy both get updated pretty quickly to support most of the new models coming out. I expect they'll both support this soon.

stereobit
1 replies
20h51m

Check out invoke.com

sophrocyne
0 replies
19h6m

Thanks for calling us out - I'm one of the maintainers.

Not entirely sure we'll be in the Stable Cascade race quite yet. Since Auto/Comfy aren't really built for businesses, they'll get it incorporated sooner vs later.

Invoke's main focus is building open-source tools for the pros using this for work that are getting disrupted, and non-commercial licenses don't really help the ones that are trying to follow the letter of the license.

Theoretically, since we're just a deployment solution, it might come up with our larger customers who want us to run something they license from Stability, but we've had zero interest on any of the closed-license stuff so far.

brucethemoose2
2 replies
18h38m

Fooocus is the fastest way to try SDXL/SDXL turbo with good quality.

ComfyUI is cool but very DIY. You don't get good results unless you wrap your head around all the augmentations and defaults.

No idea if it will support cascade.

SpliffnCola
1 replies
17h2m

ComfyUI is similar to Houdini in complexity, but immensely powerful. It's a joy to use.

There are also a large amount of resources available for it on YouTube, GitHub (https://github.com/comfyanonymous/ComfyUI_examples), reddit (https://old.reddit.com/r/comfyui), CivitAI, Comfy Workflows (https://comfyworkflows.com/), and OpenArt Flow (https://openart.ai/workflows/).

I still use AUTO1111 (https://github.com/AUTOMATIC1111/stable-diffusion-webui) and the recently released and heavily modified fork of AUTO1111 called Forge (https://github.com/lllyasviel/stable-diffusion-webui-forge).

emadm
0 replies
14h17m

Our team at Stability AI build ComfyUI so yeah is supported

yokto
0 replies
19h21m

fal.ai is nice and fast: https://news.ycombinator.com/item?id=39360800 Both in performance and for how quickly they integrate new models apparently: they already support Stable Cascade.

gorkemyurt
7 replies
1d

we have an optimized playground here: https://www.fal.ai/models/stable-cascade

adventured
6 replies
23h8m

"sign in to run"

That's a marketing opportunity being missed, especially given how crowded the space is now. The HN crowd is more likely to run it themselves when presented with signing up just to test out a single generation.

treesciencebot
3 replies
22h46m

Uh, thanks for noticing it! We generally turn it off for popular models so people can see the underlying inference speed and the results but we forgot about it for this one, it should now be auth-less with a stricter rate limit just like other popular models in the gallery.

RIMR
1 replies
18h44m

I just got rate-limited on my first generation. The message is "You have exceeded the request limit per minute". This was after showing me cli output suggesting that my image was being generated.

I guess my zero attempts per minute was too much. You really shouldn't post your product on HN if you aren't prepared for it to work. Reputations are hard to earn, and you're losing people's interest by directing them to a broken product.

getcrunk
0 replies
9h23m

Are you using a vpn or at a large campus or office?

archerx
0 replies
5h49m

I wanted to use your service for a project but you can only sign in through github, I emailed your support about this and never got an answer, in the end I ended up installing SD Turbo locally. I think that a github only auth is losing you potential customers like myself.

MattRix
1 replies
22h46m

It uses github auth, it’s not some complex process. I can see why they would need to require accounts so it’s harder to abuse it.

arcanemachiner
0 replies
10h37m

After all the bellyaching from the HN crowd when PyPI started requiring 2FA, nothing surprises me anymore.

pxoe
5 replies
16h53m

the way it's written about in Image Reconstruction section like it is just an image compression thing...is kind of interesting. for that stuff and its presented use there to be very much about storing images and reconstructing them. when "it doesn't actually store original images" and "it can't actually give out original images" are points that get used so often in arguments as a defense for image generators. so it is just a multi-image compression file format, just a very efficient one. sure, it's "redrawing"/"rendering" its output and makes things look kinda fuzzy, but any other compressed image format does that as well. what was all that 'well it doesn't do those things' nonsense about then? clearly it can do that.

wongarsu
2 replies
16h44m

In a way it's just an algorithm than can compress either text or an image. The neat trick is that if you compress the text "brown bear hitting Vladimir Putin" and then decompress it as an image, you get an image of a bear hitting Vladimir Putin.

This principle is the idea behind all Stable Diffusion models, this one "just" achieved a much better compression ratio

pxoe
1 replies
16h37m

well yeah. but it's not so much about what it actually does, but how they talk about it. maybe (probably) i missed them putting out something that's described like that before, but it's just the open admission in demonstration of it. i guess they're getting more brazen, given than they're not really getting punished for what they're doing, be it piracy or infringement or whatever.

Filligree
0 replies
4h30m

The model works on compressed data. That's all it is. Sure, it could output a picture from its training set on decompression, but only if you feed that same picture into the compressor.

In which case what are you doing, exactly? Normally you feed it a text prompt instead, which won't compress to the same thing.

gmerc
0 replies
16h24m

Ultimately this is abstraction not compression.

GaggiX
0 replies
8h22m

well it doesn't do those things' nonsense about then? clearly it can do that.

There is a model that is trained to compress (very lossy) and decompress the latent, but it's not the main generative model, of course the model doesn't store images in it, you just give the encoder an image and it will encode it and then you can decode it with the decoder and get a very similar image, this encoder and decoder is used during training so that the stage C can work on a compressed latent instead of directly at the pixel level because it's expensive, but the main generative model (stage C) should be able to generate any of the images that were present in the dataset or it fails to do its job. Stages C, B, and A do not store any images.

The B and A stages work like an advanced image decoder, so unless you have something wrong with image decoders in general, I don't see how this could be a problem (a JPEG decoder doesn't store images either, of course).

joshelgar
5 replies
22h56m

Why are they benchmarking it with 20+10 steps vs. 50 steps for the other models?

liuliu
2 replies
22h50m

prior generations usually take fewer steps than vanilla SDXL to reach the same quality.

But yeah, the inference speed improvement is mediocre (until I take a look at exactly what computation performed to have more informed opinion on whether it is implementation issue or model issue).

The prompt alignment should be better though. It looks like the model have more parameters to work with text conditioning.

treesciencebot
1 replies
22h38m

in my observation, it yields amazing perf at higher batch sizes (4 or better 8). i assume it is due to memory bandwith and the constrained latent space helping.

Filligree
0 replies
4h25m

However, the outputs are so similar that I barely feel a need for more than 1. 2 is plenty.

weebull
0 replies
2h39m

...because they feel that at 20+10 it achieves a superior output than at 50 steps for SDXL. They also benchmark it against 1 step for SDXL-Turbo.

GaggiX
0 replies
21h21m

I think that this model used consistency loss during training so that it can yield better results with less steps.

hncomb
4 replies
20h33m

Is there any way this can be used to generate multiple images of the same model? e.g. a car model rotated around (but all images are of the same generated car)

refulgentis
1 replies
19h13m

Yes, input image => embedding => N images, and if you're thinking 3D perspectives for rendering, you'd ControlNet the N.

ref.: "The model can also understand image embeddings, which makes it possible to generate variations of a given image (left). There was no prompt given here."

taejavu
0 replies
14h29m

The model looks different in each of those variations though. Which seems to be intentional, but the post you're responding to is asking whether it's possible to keep the model exactly the same in each render, varying only by perspective.

matroid
1 replies
19h36m

Someone with resources will have to train Zero123 [1] with this backbone.

[1] https://zero123.cs.columbia.edu/

emadm
0 replies
14h18m
skybrian
2 replies
14h15m

Like every other image generator I've tried, it can't do a piano keyboard [1]. I expect that some different approach is needed to be able to count the black keys groups.

[1] https://fal.ai/models/stable-cascade?share=13d35b76-d32f-45c...

GaggiX
0 replies
8h34m

As with human hands, coherency is fixed by scaling the model and the training.

Agraillo
0 replies
12h29m

I think it's more than this. In my case in most of images I made about basketball there were more than one ball. I'm not an expert, but some fundamental constrains of the human (cultural) life (like all piano keys are the same, there's only one ball in a game) are not grasped by the training or grasped partially

ionwake
2 replies
17h43m

Does anyone have a link to a demo online?

martin82
1 replies
15h50m
ionwake
0 replies
3h59m

Thank you, is there a demo if the "image to image" ability? It doesnt seem to be in any of the demos I see.

xkgt
1 replies
6h32m

This model is built upon the Würstchen architecture. Here is a very good explanation of how this model works by one of its authors.

https://www.youtube.com/watch?v=ogJsCPqgFMk

lordswork
0 replies
3h56m

Great video! And here's a summary of the video :)

    Gemini Advanced> Summarize this video: https://www.youtube.com/watch?v=ogJsCPqgFMk
This video is about a new method for training text-to-image diffusion models called Würstchen. The method is significantly more efficient than previous methods, such as Stable Diffusion 1.4, and can achieve similar results with 16 times less training time and compute.

The key to Würstchen's efficiency is its use of a two-stage compression process. The first stage uses a VQ-VAE to compress images into a latent space that is 4 times smaller than the latent space used by Stable Diffusion. The second stage uses a diffusion model to further compress the latent space by another factor of 10. This results in a total compression ratio of 40, which is significantly higher than the compression ratio of 8 used by Stable Diffusion.

The compressed latent space allows the text-to-image diffusion model in Würstchen to be much smaller and faster to train than the model in Stable Diffusion. This makes it possible to train Würstchen on a single GPU in just 24,000 GPU hours, while Stable Diffusion 1.4 requires 150,000 GPU hours.

Despite its efficiency, Würstchen is able to generate images that are of comparable quality to those generated by Stable Diffusion. In some cases, Würstchen can even generate images that are of higher quality, such as images with higher resolutions or images that contain more detail.

Overall, Würstchen is a significant advance in the field of text-to-image generation. It makes it possible to train text-to-image models that are more efficient and affordable than ever before. This could lead to a wider range of applications for text-to-image generation, such as creating images for marketing materials, generating illustrations for books, or even creating personalized avatars.

lqcfcjx
1 replies
14h23m

I'm very impressed by the recent AI progress on making models smaller and more efficient. I just have the feeling that every week there's something big on this space (like what we saw previously from ollama, llava, mixtral...). Apparently the space for on-device models are not fully discovered yet. Very excited to see future products on that direction.

dragonwriter
0 replies
13h58m

I'm very impressed by the recent AI progress on making models smaller and more efficient.

That's an odd comment to place in a thread about an image generation model that is bigger than SDXL. Yes, it works in a smaller latent space, yes its faster in the hardware configuration they've used, but its not smaller.

instagraham
1 replies
10h43m

Will this work on AMD? Found no mention of support. Kinda an important feature for such a project, as AMD users running Stable Diffusion will be suffering diminished performance.

drclegg
0 replies
3h48m
gajnadsgjoas
1 replies
22h21m

Where can I run it if I don't have a GPU? Colab didn't work

detolly
0 replies
22h7m

runpod, kaggle, lambda labs, or pretty much any other server provider that gives you one or more gpus.

cybereporter
1 replies
21h58m

Will this get integrated into Stable Diffusion Web UI?

ttul
0 replies
14h33m

Surely within days. ComfyUI’s maintainer said he is readying the node for release perhaps by this weekend. The Stable Cascade model is otherwise known as Würschten v3 and has been floating around the open source generative image space since fall.

ttpphd
0 replies
1d

That is a very tiny latent space. Wow!

sanroot99
0 replies
13h26m

What is the system requirements needed to run this, particularly how much vram it would take?

nialv7
0 replies
6h16m

Can Stable Cascade be used for image compression? 1024x1024 to 24x24 is crazy.

mise_en_place
0 replies
16h23m

Was anyone able to get this running on Colab? I got as far as loading extras in text-to-inference, but it was complaining about a dependency.

holoduke
0 replies
23h49m

Wow like the compression part. 42 fixed times compression. That is really nice. Slow to unpack on the fly. But the future is waiting.

SECourses
0 replies
15h20m

It is pretty good I shared a comparison on medium

https://medium.com/@furkangozukara/stable-cascade-prompt-fol...

My Gradio APP even works amazing on 8 GB gpu with CPU offloading

GaggiX
0 replies
23h26m

I remember doing some random experiments with these two researchers to find the best way to condition the stage B on the latent, my very fancy cross-attn with relative 2D positional embeddings didn't work as well as just concatenating the channels of the input with the nearest upsample of the latent, so I just gave up ahah.

This model used to be known as Würstchen v3.