return to table of content

StreamDiffusion: A pipeline-level solution for real-time interactive generation

acheong08
20 replies
17h46m

This feels unreal. It feels like a decade passed within a year.

mattigames
7 replies
13h52m

I can't wait until it can do my job and then I will just run it on my PC and connect it to slack so my employeer will receive similar results of when I did it manually and I will be payed without spending any time actually working, I will be able to focus on my hobbies for once. This is how this all will play out in the end right?

godelski
2 replies
12h40m

I will be payed without spending any time actually working,

This is how this all will play out in the end right?

Somebody is gonna tell him right? I don't want to be the one to crush such innocence.

mattigames
0 replies
4h37m

It was sarcasm.

ChatGTP
0 replies
3h44m

You should try using AI to gauge sentiment.

sroussey
0 replies
13h37m

Your employer can just replace you then and save the money

poulpy123
0 replies
4h59m

Yep you will be totally paid, don't worry !

holoduke
0 replies
12h28m

If you are the first. Yes dor a time. Make sure you duplicate your work among 500 other jobs and become a millionaire. Because it will last not so long when everyone finds out

buryat
0 replies
11h46m

start recording your thought process very detailed while solving problems, then train a model and sell the model to work as you, roll in money (likely not as you would be outcompeted by other models).

knlam
2 replies
14h31m

Now as a frontend developer I understand how folks complain the frontend landscape change so fast that it is impossible to keep up

nazka
0 replies
7h23m

At least this is innovation and always moving forward. For the frontend space sometimes it’s step backs, reinventing the wheel, or just because a new tool looks shinier…

mclightning
0 replies
7h33m

as a frontend developer, you understood how folks complain about the changing frontend landscape, through....changes in ML, a completely different landscape to that of your own?

Are you an LLM?

jimmyl02
2 replies
17h4m

the entire open source AI space feels like this right now. basically every day there is some new advancement that either makes something deemed impossible achievable and it's actually really hard to keep up with all the changes.

legel
1 replies
16h38m

100% agreed. I've been developing deep neural networks for over 10 years and this is just surreal.

On the bright side, one source of "sanity" that I'm finding is to review a collection of daily "hot" publications in AI/ML curated here: https://huggingface.co/papers

kwerk
0 replies
14h37m

Thanks for sharing.

Roritharr
2 replies
11h6m

This is the only thing that has me curious if that's how the progress curve feels like in the opening act of the Singularity.

PartiallyTyped
0 replies
7h54m

We are on the lower bottom of the S curve.

ChatGTP
0 replies
3h53m

Generating images on high end hardware and then...?

amelius
1 replies
17h25m

These softwares develop faster than I can apt-get install them.

throwup238
0 replies
16h42m

Not even ArchLinux’s rolling community repositories can keep up.

I’ve had to git clone everything like a package manager-less serf. Where even is my filesystem?!

gigel82
0 replies
15h35m

Reminds me of an incremental game ( https://www.reddit.com/r/incremental_games/ ) - BTW, don't start playing one of those or you'll ruin your holidays... :)

brcmthrowaway
12 replies
17h50m

What is the fps on Apple Silicon?

washadjeffmad
9 replies
16h30m

0 because there's no MPS support.

However, a Studio with an M1 Max 64GB is ~13x slower at generative AI with SD1.5 and SDXL than an RTX 4090 24GB at the same cost (~$1,800, refurb) right now.

Terretta
8 replies
15h38m

0 because there's no MPS support. ... Studio with an M1 Max 64GB is ~13x slower at generative AI with SD1.5 and SDXL than an RTX 4090 24GB at the same cost (~$1,800, refurb)

Does the 4090 have a computer attached to it? It seems like with no computer, the speed would also be 0.

teaearlgraycold
4 replies
15h3m

If we’re being snarky “Apple Silicon” won’t work without a motherboard and power supply either.

givinguflac
2 replies
12h47m

I get what you’re saying but I don’t think there was snark. Just the fact that a 4090 without a computer attached won’t work. It’s not like you can buy apple silicon without a Mac attached.

jbverschoor
1 replies
4h22m

You can just get a pci enclosure, and use the hardware.. Attaching it to a VM makes sense bc of drivers etc.

washadjeffmad
0 replies
28m

eGPUs don't work with Apple Silicon Macs, only Intel. We ran into a lot of the limitations early on, and this is the only reason we still have 2018 Mac Minis and 2019 Mac Book Pros.

https://support.apple.com/en-us/102363

Five years, and still no solution. And somehow they're spinning memory bandwidth as some sort of prescient act of Apple genius for AI. It's insulting.

numpad0
0 replies
5h6m

I think the line between the line in GP is people blindly believing into Apple marketing graphs is annoying; Apple Silicon GPU marketing comparisons against NVIDIA GPUs are made using laptop variants, which were at some point exact same silicon as desktop GPUs software limited to fit within laptop power/cooling brackets, but not in 30/40 generations.

echelon
2 replies
14h43m

AI is best done in the Linux/Ubuntu/Pytorch/Nvidia ecosystem. Windows has some exposure due to WSL/Nvidia.

Mac is not a great place for AI/ML yet. Both the hardware and the software present challenges. It'll take time.

When I was hacking AI stuff on a Macbook, I had a second Framework laptop with EGPU that I SSH'd to.

sroussey
1 replies
13h27m

I think the tensor core in the 4090 really help, and of course CUDA supporting every hardware they offer (cough cough, rocm) means that researchers are going to start there.

That said, I think Apple will have some interesting stuff in a year or two (M4 or more likely M5) where they can flex their NPU, Accelerate framework, and unified memory GPU and have it work with more modern requirements.

Time will tell what their software and hardware story is for local inference for generative AI.

Siri (dictation, some assistant stuff, and TTS) runs on device, and I doubt they want to undo that.

I doubt they will do much for training, but maybe a NUMA version of a MacPro with several M4 Ultras will prove me wrong?

echelon
0 replies
13h24m

That said, I think Apple will have some interesting stuff in a year or two (M4 or more likely M5) where they can flex their NPU, Accelerate framework, and unified memory GPU and have it work with more modern requirements.

Plus two years for software support by the broader ecosystem.

Even Windows, with Cuda + drivers, suffers from less support.

yazaddaruvala
0 replies
11h47m

I run DrawThings with SDXL Turbo on my M1 Pro w/ 32GB RAM

I get a 512x512 5 step image generated in 5 seconds. No refiner, upscaler, or face restoration.

My understanding is that DrawThings hasn’t been optimized for SDXL Turbo and/or pipelined generation yet.

For reference: SDXL Base+Refiner with face restoration at 2k x 2k 50 step image generation takes about 120 seconds.

gaogao
0 replies
17h46m

At least an 1/8 or so, but yeah, getting it running on Apple at at least 24fps would be huge. Some degree of interpolation might do it. You maybe could get away with 12fps esp. with an anime aesthetic since that's basically animating on 2's.

smusamashah
4 replies
15h42m

I just tried the realtime-text2img demo (uses npm for frontend which i think is too much for this). Modified it to produce only 1 image instead of 16. Works well on a laptop with RTX-3080. It's probably 2 images / sec.

EDIT: The `examples\screen` demo almost feels realtime. Says 4 fps on the window but don't what it represents.

EDIT: Denoising in img2img is very low though which means thee returned image is only slightly different from base image.

godelski
3 replies
12h42m

How's the actual quality, diversity, and alignment though. I'm away from my GPU for a few days. It's always hard to judge generative papers without getting hands on because you write to the reviewers which means you gotta cherry pick (I think this is bad, but it's where we're at). They're using tiny autoencoder? Artspew did that too and was getting higher FPS (but weren't using TensorRT but were using triton) but the quality was garbage (still cool). Regardless, these are impressive even if the quality isn't anywhere near whats shown, but it's hard to tell.

smusamashah
1 replies
9h8m

Not good which is very understandable. You are not supposed to use the output as final image. Find a good propt/seed by iterating quickly. Then go for higher step count to render a higher quality image.

foxhop
0 replies
5h9m

I use automatic1111 hires fix with SDXL-turbo on a RTX 4090 and it looks the best by far for high quality images once you find a good prompt/seed (but heres the thing turning it on makes all the prompts/seeds that much better...)

numpad0
0 replies
5h13m

If you mean not disciplined to caricaturize and stigmatize racial stereotypes a la Disney films by diversity and alignment, that won't be coming from anywhere Asia... Especially the "Asian" face makeup prevalent in NA, sometimes derogatorily called "Pocahontas" face. That one is American special.

modeless
4 replies
17h22m

Does 100fps mean I can provide a new input every 10 ms and get a new output every 10ms? Or do inputs need to be batched together to get that average throughput?

stale2002
3 replies
17h3m

I haven't tried it, but just taking an educated guess, which is that I don't think batching should be required.

The slow part for models is loading the model up. But once the model is up, you can send it whatever input that you want.

Parsing and sending the image data just doesn't pass my gut check as to what would be the bottleneck here.

modeless
1 replies
16h56m

That's not the issue, the issue is GPU utilization. Batching enables higher utilization and higher images per second throughput, but doesn't improve latency.

Kubuxu
0 replies
16h15m

In essence batching allows for more efficient usage of memory bandwidth. Without batching for every generation you need to transfer the whole model from GPU memory to GPU core once for every image which sets an upper bound on speed. With batching the bottlenecks start showing up elsewhere.

For SD1.5 4090 is able to do ~17it/s without batching and ~90-100it/s with batching.

Although these numbers might be old at this point, I looked at it ~3mo ago.

godelski
0 replies
12h3m

Everyone does warmup before you measure. But measuring isn't always done right because we actually measure the GPU time only but some people naively use CPU time which is problematic because the process is asynchrenous. They have a few timing scripts though and I'm away from my GPU. There are some interesting things but they look like they know how to time. But it can also get confusing because is it considering batches or not. Some works do batch some do single. Only problem is when it isn't communicated correctly or left ambiguous.

Their paper is ambiguous unfortunately. Abstract, intro, and conclusion suggests single image by motivating with sequential generation (specifically mentioning metaverse). Experiment section says

We note that we evaluate the throughput mainly via the average inference time per image through processing 100 images.

That implies batch along with their name Stream Batch...

Looking at the code I'm a bit confused. I'm away from my GPU so can't run. Maybe someone can let me know? This block[0] measures correctly but is using a downloaded image? Then just opens the image in the preprocess? (multi looks identical) This block[1] is using CPU? But running CPU. (there's another like this)

So I'm quite a bit confused tbh.

[0] https://github.com/cumulo-autumn/StreamDiffusion/blob/03e2a7...

[1] https://github.com/cumulo-autumn/StreamDiffusion/blob/03e2a7...

ilaksh
4 replies
18h19m

How does the demo with the girl moving in and out of frame work? Is it ControlNet?

ec109685
1 replies
17h47m

So left is the source image and right is the resultant image?

washadjeffmad
0 replies
17h8m

yes. compare to animatediff.

woleium
0 replies
18h3m

Its video input. from tfa:

Stochastic Similarity Filter reduces processing during video input by minimizing conversion operations when there is little change from the previous frame, thereby alleviating GPU processing load, as shown by the red frame in the above GIF

Flux159
0 replies
18h0m

I think it’s just img2img with a prompt & rcfg scale and no controlnet since theres a GitHub issue about adding controlnet support open at the moment.

programjames
3 replies
16h23m

This paper is horribly written. It's like the authors are trying to sell me on them as researchers, instead of helping me understand their research (y'know, the entire reason journals got started??). An entire section for "stream batching" was just too much, and none of their ideas were innovative or unique. It was incredibly dense, simply because it's obfuscated, which makes me believe the authors themselves don't really understand what they're doing.

The results aren't even very good. They claim 60x speedup, but compared to what? HuggingFace's Diffusers Autopipeline... a company notorious for buggy code and inefficient pipelines. And that's for naively running the pipeline on every image. Give me a break.

kristopolous
1 replies
16h0m

Somehow just hacking together code to create something is considered publishable these days. The code works but it really is just pasted together stuff from the last few weeks of research.

godelski
0 replies
10h51m

I don't think I have a problem with this tbh. Though this specifically looks more engineering and product oriented. What I do have a problem with is comparing papers across vastly different TRLs and comparing works done with 100 GPU years of compute to works with 1 GPU year (or less). Just completely different class of works and comparing is void of context, you know?

The reason I don't have a problem is I see papers as how we researchers communicate with other researchers. But I feel that's not how everyone sees them and there's the aspect that this is how we're judged so incentives get misaligned with actual goal. Idk if the reward hacking is ironic or makes sense because our job is to optimize. But don't let anyone try to convince you that reward (or any cost function) is enough.

godelski
0 replies
10h57m

instead of helping me understand their research (y'know, the entire reason journals got started??)

ML is crazy right now and people don't see papers as means of researchers communicating to other researchers. You write papers to reviewers. But your reviewers are stochastic so it's hard to write to them because they may or may not be in your niche.

I'll add though that this isn't why journals were created and that CS/ML doesn't typically use journals ({T,J}MLR, PAMI, and a few exist, sure) and instead write to conferences. Fixed dates, zero sum, 1.5 shot setting (1 rebuttal, zero revisions). Journals were created for dissemination of papers, indirectly about communicating to one another, but you know... now we got Arxiv and blogs and websites are sometimes way better just like how papers got better with pictures with computer graphics.

timexironman
1 replies
10h17m

Is there a video of it I can view anywhere?

ChatGTP
0 replies
3h46m

try clicking on the link ?

badloginagain
1 replies
17h43m

Yo I just heard about MidJourney this year.

And this appears to be a local runtime stable diffusion streaming library?

Bruh.

Keyframe
0 replies
17h30m

Singularity is real, but it's people. Amazing fast-paced progress.

_joel
1 replies
9h41m

Maybe we're all living in a simulation^H^H^H^H^H pipeline-level solution for real-time interactive generation.

ChatGTP
0 replies
3h52m

Maybe we're not?

kristopolous
0 replies
17h18m

This more or less just worked as documented. Most of these demos tend to blow up and give really wonky deep errors.

Good job. Give it a try. Look into the server.py of realtime-txt2img to change the model if you want to generate something other than anime. Pointing it to say https://huggingface.co/runwayml/stable-diffusion-v1-5 works fine.

The results are genuinely fast. Not great, but fast. If you change to the SDXL via LCM-LoRA https://huggingface.co/latent-consistency you may get better stuff but that's when it's going to get difficult and you'll start to run into those mysterious crashes I talked about that require, you know, actual work.

my setup: 4090/3990x/CUDA 12.2/debian sid. ymmv.

Flux159
0 replies
18h37m

Arxiv paper here https://arxiv.org/abs/2312.12491

I think that it's possible to get faster than their default timings for a 4090 (I have been able to get 10fps without optimizations with SDXL Turbo and 1 iteration step), but their other improvements like using a Stochastic Similarity Filter to prevent unnecessary generations are good for getting fast results w/out having to pin your GPU at 100% all the time.