return to table of content

Show HN: Real-time image generation with SDXL Lightning

altryne1
9 replies
19h57m

I've used this + Groq yesterday to augment (with a chrome extension) the infinite fun game from Neal Agrawal, but generate actual images and not only emojis.

This feels like the future, near real time image and LLM generation (using Mixtral from Groq as my prompt writer) and Fal API for read time generation!

https://x.com/altryne/status/1760561501096575401?s=20

lakshyaag
4 replies
19h47m

Woah, that's pretty cool. Wondering if it can be converted into a card-based game

pitherpather
3 replies
19h13m

a game

Competitive prompting?

Zircom
1 replies
18h43m

I've had an idea for a cards vs humanity style game but using image generation instead, there's a central card for the round and you add something from your hand and then pick from 5-10 images to submit.

jcims
0 replies
18h15m

Prompt Fighter!

airstrike
1 replies
17h52m

that was a pretty cool video / demo / idea. good stuff!

bigboy12
0 replies
15h41m

Holy shit!!!

mdrzn
0 replies
11h33m

Damn that looks great! Any chance of sharing the chrome extension?

albert_e
0 replies
12h35m

Idea: convert this into a side scrolling game where the background gradually and seamlessly transitions into a rendering of the words we are dealing with as we progress.

I am imagining the green lush landscape from early parts of the demo to slowly transform into the dry mountainous landscape from later images while new characters appear in the foreground.

(I'd posted this comment incorrectly under the main HN post earlier, instead of as reply here. Too late to delete it apparently.)

01HNNWZ0MV43FF
3 replies
16h9m

What's the RAM and speed like for local inference?

r-k-jo
2 replies
14h48m

It's using ~15GB VRAM

lonk
1 replies
8h40m

What is the speed with CPU+16GB ram without GPU?

whywhywhywhy
0 replies
3h1m

SDXL normally takes 40-60+ minutes per image on CPU so considering this is 1-4 steps instead of 20-25 steps you can make a guess.

treesciencebot
2 replies
20h37m

Yep, this is using SDXL Lightning underneath which is trained by ByteDance on top of Stable Diffusion XL and released as an open source model. In addition to that, it is using our inference engine and real-time infrastructure to provide a smooth experience compared to other UIs out there (which as far as I know, speed-wise, are not even comparable, ~370ms for 4 step here vs ~2-3 seconds in the replicate link you posted).

smallerfish
1 replies
9h30m

Any plans to make an API? I'm building a website to catalog fairly common objects, and could use images to spice it up. I was looking at pexels...but this is just so much better.

EDIT - ah you have one. You're welcome. Sign up here folks. :)

Couple of questions in that case: a) What is the avg price per 512x512 image? Your pricing is in terms of machine resources, but (for my use case) I want a comparison to pexels. b) What would the equivalent machine setup be to get inference to be as fast as the website demo? c) Is the fast-sdxl api using the exact same stack as the website?

drochetti
0 replies
4h7m

There's no hidden magic in the playground and in the demo app, we use the same API available for all customers and also the same JS client and best practices available in our docs.

To all your questions, I recommend playing with it in the API playground, you'll be able to test different image sizes, parameters, and have an idea of the cost per inference.

If you have any other questions, say hello on our Discord and I'm happy to help you.

https://fal.ai/models/stable-diffusion-xl-lightning

r-k-jo
0 replies
14h53m

I also made a demo with Gradio, but it's 2x slower than fal.ai! Using stable-fast compilation running on a single A10G

https://huggingface.co/spaces/radames/Real-Time-Text-to-Imag...

I you have GPU/cuda/Docker you can try it locally

docker run -it -p 7860:7860 --platform=linux/amd64 --gpus all -e SFAST_COMPILE="1" -e USE_TAESD="0" registry.hf.space/radames-real-time-text-to-image-sdxl-lightning:latest python app.py

thyrox
6 replies
14h50m

Wow this is super impressive but does somebody know a way to generate consistent characters with stable diffusion?

What I mean is if my first prompt is a girl talking to cat and second prompt is girl playing with that cat, I want the girl and cat to be the same in both pictures.

Is that possible? If so any links or tutorials will be super helpful to learn.

ppsreejith
0 replies
13h36m

IIRC Dashtoon studio allows you to create comics with consistent characters using stable diffusion: https://dashtoon.com/create

padolsey
0 replies
9h58m

Check out https://scenario.gg - they let you train your own LoRAs on custom images of a character (you need around 20 or so images from different angles for good consistency). A bit simpler, and actually still pretty decent is IP-Adapter, which they also support. Having the cat be consistent is going to be challenging without a custom LoRA I reckon. See this for guidance: https://help.scenario.com/training-a-character-lora

jumboking63
0 replies
8h9m

You can do this on Dashtoon Studio. They let you upload just one image and train a consistent character Lora. It's a software for AI comic creation. Found this video on their youtube https://www.youtube.com/watch?v=EEQwEvKQGvE Lora is by far the most versatile because you can get your character consistently in any pose and in any camera angle. IP adapter replicate too many traits from the input image, and you can't choose what not replicate like the pose. So getting a character from a portrait input to do anything else can become. For Reactor you need a generated image into which you can swap in a face. Works very well for realistic images, for stylized images the style is not maintained. Also hairstyle won't get copied. So Dashtoon is the most reliable thing and easy thing I've found so far because collecting 20 images of a new character is hard and the properties of the images in a Lora training set are really important like how many close ups, how many expressions etc.

Zetobal
0 replies
8h41m

It's usually enough to just use names "Maria Smith" will almost always look like "Maria Smith" in good SD models.

OKRainbowKid
0 replies
11h38m

Check out IP-Adapter, FaceID, and Reactor.

airstrike
6 replies
18h20m

Absolutely love this. Wish URLs were shareable!

`late 90s movie poster, 24 hour clock movie "2: Electric Boogaloo" dan aykroyd1`

turned out great

drochetti
2 replies
11h47m

We just added share. Let me know what you come up with!

airstrike
0 replies
9h23m

love it! I have to log off but I should let you know it seems like the generation is different depending on whether you arrow up or arrow down into the seed when the focus is on the seed input (i.e. going up from 5 to 6 will have a different result than going from 7 to 6)

scottmf
0 replies
14h14m

providing the seed would have allowed it to be shared

refulgentis
5 replies
19h43m

_Really_ impressive demo but it'd be oh-so-much-more-impressive if it was smooth, right now ex. deleting a word or adding a space causes 4 inferences in quick session so it feels janky (EDIT: maybe intentional? step by step displayed?)

Btw this is from fal.ai, I first heard of them when they posted a Stable Cascade demo the morning it was released.

They're *really* good, I *highly* recommend them for any inferencing you're doing outside OpenAI. Been in AI for going on 3 years, and on it 24/7 since last year.

Fal is the first service that sweats the details to get it to the point it runs _this_ fast in practice, not just in papers. ex. web socket connection, use short-lived JWTs to avoid having to go through an edge function to sign a request with an API key, etc.

jameshart
4 replies
17h47m

Good point. If it’s this fast, maybe it should generate intermediate images along a smooth path through the latent space, rather than just jumping right to the target

refulgentis
3 replies
17h36m

It's sort of the inverse if I'm seeing it correctly: adding one character triggers one inference, but you see steps 1, 2, 3, and 4 of the inference

the latent space stuff became popular through it being a visual allegory, which accidentally confused the technical term it originated from. there's nothing visually smooth about it, it's not a linear interpolation in 3D space, it's a chaotic journey through 3 billion dimension space

FeepingCreature
1 replies
8h1m

Sure, but it still has to result in a smooth interpolation. If the relation between latent and pixel space isn't continuous you're gonna have problems during learning.

refulgentis
0 replies
1h59m

Might be visually consistent in 3B dimensions, but it is most certainly not visually consistent to human vision.

jameshart
0 replies
17h3m

Well, it ends up being a journey through different images pulled from the same noise, so yes, any smoothness results more from the degree to which the sampling approach produces similar features when pulled towards slightly different target embeddings than from intrinsically the images being 'neighbors'.

These low-step approaches probably preserve a lot less of the 'noise' features in the final image so latent space cruising is probably less fun.

padolsey
3 replies
9h47m

This is so quick! The demo being publicly consumable is powerful, though SDXL is not immune to abuse or NSFW (and possibly illegal) generations. I wonder who ends up being held accountable for such generations.

whelp_24
2 replies
9h11m

That's a strange question. Why is nudity something that an image generator shouldn't be able to create? Are genitals a fiction that should never ve present in human art?

sebzim4500
0 replies
9h8m

Probably the fear is about deepfakes rather than generic porn.

padolsey
0 replies
9h7m

Oh indeed. I believe these models should be uncensored. But as we’ve seen with the latest SD model, and with the much more locked down ‘LLM-fronted’ image generators from OpenAI and Google, safety is a massive concern and so they’ve been ridiculously cautious. Not only with the outputs but also with the training material. (‘Porn-In-Porn-Out’)

Regardless of how we feel, lawyers and regulators wait at the door. We should expect new legal precedent within the next year re the generation of copyright-infringing, deepfake, and pornographic material.

treesciencebot
0 replies
18h18m

Spatial prompt adherence is a general missing piece is SDXL (or previous versions of the SD). Hoping that the SD will get it into a good shape as your examples!

Test the example on Stable Cascade as well (latest open-weight stability model), and yeah, even that is not great at it https://fal.ai/models/stable-cascade?share=eab44060-690b-497....

nomel
0 replies
14h17m

Cycling through different seeds gives very different results.

Glyptodon
3 replies
18h19m

Speed is impressive, but it doesn't seem to know what a pappenheimer rapier is. Keep getting a guy with weird bladed staff.

jachee
2 replies
15h38m

I’m a 46-year old former LARPer, D&D nerd, etc, and I don’t know what a pappenheimer rapier looks like… it might be a bit niche. :)

_ache_
0 replies
15h21m

My test is Axolotl. That it doesn't seem to have seen enough in the training set. The color is right (young one). Doesn't sound too niche.

ultimoo
2 replies
15h16m

inferencing images at the speed of typing is such a fantastic way to show off what it's capable of. kudos!

lelag
1 replies
12h19m

And also what it is not capable of doing too...

Simply try to get it to output an image with a female that's not a beauty queen. Even when specifically prompted to produce an image of ugly people, it can only generate beautiful people.

tdudhhu
2 replies
11h23m

Wow this is fast!

The results also look great, but the more I see AI generated images the more I believe it is not going to eat jobs.

Almost none of the results are production ready. They all contain strange parts. And maybe more important: the look and feel is always the same.

It is amazing this works as fast as it does, but I think AI is still in it's 'hype' stage.

jjbinx007
0 replies
11h17m

Midjouney is more advanced and some of the output is definitely job stealingly good.

It can do excellent photo real, sketch, pixel art, diagrams, painting, digital painting, 3d, all to a high standard.

I think of current ai/ml stuff right now as a very fast intern assistant that will get you 85% of the way there but needs supervision.

But the technology is still so new it will only get better.

ben_w
0 replies
9h54m

the look and feel is always the same.

This is an important thing, but not a consistent thing between humans. For example, while I can tell half of these are AI (and not just because I typed in the prompt), they have very different looks and feels to me:

https://fastsdxl.ai/share/6djh0dlat0s6 "Will Smith facing a white plastic robot, close up, side view, renaissance masterpiece oil painting by da Vinci"

https://fastsdxl.ai/share/ctwqegl5i3xq "a hand stitched embroidery of a cute tiger-racoon playing in woodland"

https://fastsdxl.ai/share/mkfrx33xc4ee "a selfie shot of a furry in a furry convention"

https://fastsdxl.ai/share/mphyrzzjsces "Simple sketch of Mordor, Mount Doom, dark and moody, despair, dense fog"

https://fastsdxl.ai/share/hgjwx6avyx0h "coffee mug stain on paper"

But there are many others like yourself who apparently have higher standards than I do.

(And there are also many who have lower standards than me, who were happy to print huge posters where the left and right eyes of subject didn't match).

doodlebugging
2 replies
19h39m

"A cinematic shot of a coelacanth armed with a yellow corinthian leather hand cannon"

Needs work.

barnabyjones
1 replies
18h57m

The neat thing about this speed though is you can flip through the seeds quickly. Seed 626925 is giving me a fish holding some kind of gun, with what I guess are leather gloves. This has always been the main problem with SD imo, it can't really parse sentence structure so adjective descriptions often don't affect the thing you want.

doodlebugging
0 replies
18h28m

Yeah, I like the speed of the renders. It feels relatively smooth.

OP's post to me feels like a marketing post where the output image is a really close representation of the product they hope to sell. We always called these types of things "carefully selected, random examples", in short they are cherry-picked for their adherence to a standard.

In that same vein mine is also a carefully selected, random example of the output you get when the algorithms don't work well, therefore the "Needs work" qualification.

Both are useful since you need to understand the limitations of the tools that you are employing. In my case I stepped thru animals until I found one that it could not render accurately. It did know that a coelacanth is a fish but it couldn't produce an accurate image of one. Then I added modifiers that it could not place in context.

It's a bit like searching the debris field of a tornado for perfectly rounded debris particles and holding those up as a typical result without mentioning that you end up having to ignore all the splintery debris scattered from hell to breakfast around it.

actionfromafar
2 replies
20h40m

How ... can it be so fast? And what is "blob:https://blbahblah" image?

By the way the racoon is very prone to get two tails if you change the prompt a little. :)

supermatt
1 replies
11h24m

What is the difference between SDXL turbo (released last november) and lightning (released 2 days ago?)? I havent seen any discussion of lightning on here the last few days and hnsearch only shows a few posts with no comments.

supermatt
0 replies
11h20m

OK - I found the lightning paper and it seems the difference (to an end user) is turbo is max 512px (with no lora support) and this is 1024px (with lora support). The example images in the paper are also subjectively better quality and composition - but appear more stylised to my eye.

link: https://arxiv.org/html/2402.13929v1

pugworthy
1 replies
20h11m

It's curious what it does with single letters. Seems for me to often settle on a small rather detailed building. The more I repeat the letter (e.g., "111" vs "11111111" the odder the building gets. Which I can see now seems pretty sensitive to the seed.

samus
0 replies
13h35m

A word or a concept that is unknown has simply no impact on the output. Try to replace "baby raccoon" with "maxolhx" in the prompt, and it will ignore the word and render an Italian priest instead. Strictly speaking it still has an impact, but nothing we could easily describe. You're pretty much just playing with the seed.

jansan
1 replies
9h57m

Very cool. How about putting the images that were created along the way into a carousel that you can swipe through, just like a flipbook?

albumen
0 replies
9h35m

This is a great idea. Being able to scan through previous images would encourage freer concept exploration, since you could then jump back to your favoured branch-off point easily.

a1o
1 replies
18h19m

Yeah it can't do pixelart or any other thing that I tried. But yey speed.

treesciencebot
0 replies
18h14m

pixel art is a particularly hard thing for these models to do, especially without further fine tunes or loras. but I'm pretty sure you should be able to get that quality with one of nerijs's loras [0]. But for now, i'd do some prompt templating and try some variations of this: 'pixel-art picture of a cat. low-res, blocky, pixel art style, 8-bit graphics'

[0]: https://huggingface.co/nerijs/pixel-art-xl

victorbjorklund
0 replies
11h23m

Damn that is fast.

timnetworks
0 replies
6h30m

That's very fast... typing in SDXL as the prompt gives a chilling self-portrait though.

tiborsaas
0 replies
6h25m

Insanely cool :)

One bug though: when I increment the seed, it renders two image and it takes a bit of jumping up and down in seed numbers to get to the same image.

ritzaco
0 replies
9h36m

This is so impressive. Is it an ad for fal.ai? Or who is paying for it?

rcarmo
0 replies
5h6m

Ooooookaaaaaay.... This is faster than autocomplete. Amazing.

rcarmo
0 replies
3h58m

Note: doesn't seem to work on iPad.

ozfive
0 replies
9h53m

It's so fast I question the quality of all other models that cost money.

monkeydust
0 replies
10h4m

This is nuts, in a good way. Way more fun to have fast inference on images than text.

hoc
0 replies
20h23m

"bug with long antennas and chrome legs"

you only end there because it's so fast. nice. really a new explorative quality.

fareesh
0 replies
8h31m

Nancy Pelosi as a ninja without a mask = Nancy Pelosi as a ninja wearing a mask

diego_sandoval
0 replies
15h13m

This is amazing. How much does it cost per image?

danielecook
0 replies
14h34m

This is incredible. The reduction in latency has a profound effect in terms of the way I interact with this type of tool. The speed benefit is more than just more image generation. The speed here helps me easily keep the same train of thought moving along as I try different things.

bsenftner
0 replies
8h44m

Seems like with this kind of speed, the only reason this is not a video tool is the lack of consistency controls?

albert_e
0 replies
16h28m

Idea: convert this into a side scrolling game where the background gradually and seamlessly transitions into a rendering of the words we are dealing with as we progress.

I am imagining the green lush landscape from early parts of the demo to slowly transform into the dry mountainous landscape from later images while new characters appear in the foreground.

_ache_
0 replies
15h27m

Still doesn't pass the Axolotl test. :/ I guess it need more memory.

SeanAnderson
0 replies
17h25m

wow! this is really fast. this is actually at the speed I want image generation to be at. it makes me want to explore and use the tool. everything else has felt like begrudgingly throwing a prompt and eventually giving up because the iteration cycle is too slow.

of course the quality of what is being generated is not competitive with SOTA, but this is going in a really good direction!

Reubend
0 replies
20h27m

I love this demo. It's easily accessible, fast, and intuitive. It's stunning that we can get this level of quality so easily.