return to table of content

Visual Anagrams: Generating optical illusions with diffusion models

minimaxir
11 replies
22h17m

Note that this technique and its results are unrelated to the infamous "spiral" ControlNet images a couple months back: https://arstechnica.com/information-technology/2023/09/dream...

Per the code, the technique is based off of DeepFloyd-IF, which is not as easy to run as a Stable Diffusion variant.

SamBam
4 replies
21h53m

I missed it, what was infamous about it?

minimaxir
3 replies
21h35m

It created a backlash because a) it was too popular with AI people hyping "THIS CHANGES EVERYTHING!" and people were posting low-effort transformations to the point that it got saturated and b) non-AI people were "tricked" into thinking it was a clever trick with real art since ControlNet is not ubiquitous outside the AI-sphere, and they got mad.

andybak
1 replies
19h34m

I rather liked it and actually didn't get to see as many examples as I wanted to.

Is there a good repository anywhere or is it just "wade through twitter"?

swyx
0 replies
17h50m

not a repository as such but i linked to some good examples in my sept recap

https://www.latent.space/p/sep-2023

https://github.com/swyxio/ai-notes/blob/main/Monthly%20Notes...

yreg
0 replies
11h53m

It is real art.

tpudlik
1 replies
12h49m

Did you mean to say it's _related_? The original "spiral" image by Ugleh is explicitly credited in the "Related Links" section.

minimaxir
0 replies
12h0m

It’s a similar topic which is why they credit it but the mechanism is much different.

ShamelessC
1 replies
21h16m

Per the code, the technique is based off of DeepFloyd-IF, which is not as easy to run as a Stable Diffusion variant.

I haven't dug in yet, but it _should_ be possible to use their ideas in other diffusion networks? It may be a non-trivial change to the code provided though. Happy to be corrected of course.

minimaxir
0 replies
21h12m

I suspect the trick only works because DeepFloyd-IF operates in pixel space while other diffusion models operate in the latent space.

Our method uses DeepFloyd IF, a pixel-based diffusion model. We do not use Stable Diffusion because latent diffusion models cause artifacts in illusions (see our paper for more details).
Der_Einzige
1 replies
21h50m

I always thought it was weird that this idea took off with that particular controlnet model. Many other controlnet models when combined with those same images produce excellent and striking results.

The ecosystem around Stable Diffusion in general is so massive.

minimaxir
0 replies
21h35m

Other ControlNet adapters either preserve the high-level shape not enough or preserve it too well, IMO. Canny/Depth ControlNet generations are less of an illusion.

willsmith72
9 replies
19h42m

This colab notebook requires a high RAM and V100 GPU runtime, available through Colab Pro.

That's sad, I would've loved to try it.

DonHopkins
6 replies
17h27m

Have you never put quarters into a PacMan machine?

willsmith72
5 replies
16h48m

i take cars for test drives before buying them

DonHopkins
3 replies
12h18m

Do you hang out a GameStop all day and test drive cars in GTA instead of renting a game about stealing them?

Aren't you sad they don't just let you shoplift it for free?

willsmith72
2 replies
5h44m

All day? No. For 5 min to see if I like the game? Sure.

DonHopkins
1 replies
4h51m

Do you also think it's sad you can't sneak into Disneyland for 5 minutes for free just to see if there are any streakers in It's a Small World?

I'm getting the impression you're just an entitled gamer who wants a free ride from the University of Michigan, not a professional programmer or AI developer who would actually get some tangible value out of subscribing to ChatGPT for $20 a month. I'm thankful to be alive in a time I can so conveniently get so much value for so little cash.

Is that the case? Is $10 really too much to ask to use a high-end GPU for a month? Then it's not really as sad and hopeless as you complain it is. Just be a good boy all year, ask Santa for an GeForce RTX 4090 for Christmas, leave some cookies and milk out for him, and hope your parents get the hint!

willsmith72
0 replies
2h42m

hah wrong on all accounts. not a gamer, yes a professional "programmer", yes paying $20 for chatgpt.

people don't pay for things without getting a feel for what they're getting. hence the huge focus in saas on various monetisation strategies. if someone puts these anagrams in a product, it will be freemium or have a tree tier, and then i will play with it.

there are 20 new projects like this every day, i'm not going to pay for all of them just to try them. i'll try the product if/when there is one

matsemann
0 replies
6h6m

So don't buy a V100 then, and test it for a few bucks online somewhere. If you want other's to provide yo that hardware for free, with no chance of you actually buying you just come across as entitled.

nomel
0 replies
18h47m

I completely disagree. It's fantastic that we can get access to this hardware for so cheap. A used V100 is $1300. You could pay for Colab Pro for 10 years with that, which will get you faster and faster hardware through the years. Where I am, a month is the cost of two bags of chips.

andybak
0 replies
19h36m

well - chuck $10 at it and spend the rest of your month trying other things.

(Back in Disco Diffusion days I was happy to spend money on Colab Pro. It was fun)

jamilton
5 replies
21h22m

I really like the man/woman inversion.

I wonder how many permutations could legibly be generated in a single image with an extended version of the same technique. I don't understand the math, but would two orthogonal transformations in sequence still be an orthogonal transformation and thus work?

xanderlewis
2 replies
20h39m

I’m not sure whether ‘orthogonal transformations’ in this context refers to the usual orthogonal linear transformations (/matrices), but if so then yes.

mkl
1 replies
11h0m

The article explicitly specifies orthogonal matrices.

xanderlewis
0 replies
5h24m

I saw that, but I’ve come across strange reuses of terminology before so I didn’t want to assume.

kurthr
0 replies
9h55m

The mosaics of a duck and a rabbit, however, was hilarious.

hombre_fatal
0 replies
17h22m

The man/woman one stuck out to me as well. I probably watched it ten times. Probably because it seems so forlorn.

mg
4 replies
21h50m

I had a similar idea early last year and also dabbled with a checkerboard approach.

Here a cat is made from 9 paintings of cats in the style of popular painters:

https://twitter.com/marekgibney/status/1521500594577584141

You might have to squint your eyes to see it.

I made a few of them and then somehow lost interest.

hammock
2 replies
19h35m

That's really cool. Can you do 3x3x3? As in, 9x9 with 81 1-cell cats, 9 9-cell cats and 1 81-cell cat?

mg
1 replies
11h11m

That could be interesting. A recursive cat, so to say.

The problem would be this: In the picture at hand, the big cat is rather simple. Just a portrait of a smiling cat. While the 9 smaller cats are doing all kinds of poses to adjust to the form of the big cat portrait. So the subcats are more complex than the main cat.

When doing the recursive cat, it would be hard to make a subcat from 9 subsubcats because the subcat is already a complex image that is not as easy to recognise as the main cat.

rereasonable
0 replies
4h52m

This thread reminded me of this old gem: https://thesecatsdonotexist.com/ (warning: you may see some catspiders / r/Imsorryjon material!)

Now what would be interesting is a "demixer" which allows you to locate the source image(s) from multiple interations of a given image. Like a reverse image search but for generative images. I suppose it would rely on artefact matching or some other kind of granular pattern matching, along with other more general methods (assuming the source material is actually available online in the first place).

rob74
0 replies
9h24m

That looks more like a cat-aclysm to me TBH. Probably the model was overwhelmed by the conflicting requirements, so that neither the individual images nor the composite image are particularly good. But, as you wrote, maybe they will get better at this eventually...

hammock
4 replies
19h34m

Do real-life jigsaw puzzles like the ones shown here, exist for purchase?

mkl
2 replies
10h54m

This research uses DeepFloyd IF, which forbids commercial use. They'd need to find/train another suitable image generator.

hammock
0 replies
1h14m

I’m curious how they even thought of the idea to train a jigsaw puzzle like that in the first place. My naive guess was that those types of puzzles were preexisting. If in fact it’s a novel type of puzzle, that idea in itself is as cool as the generator they created!

bertil
0 replies
6h13m

I’m curious if they could ask for permission from the original authors (who doesn’t love a fun puzzle?—and it’s not like the profit motive here is alarming): most licenses are default permission.

You can always reach out and ask for a one-off in good faith.

shanedrgn
0 replies
13h58m

You could always make it yourself! Not sure how well the method above would scale up though https://www.createjigsawpuzzles.com/

mdonahoe
2 replies
17h24m

The man/woman color inversion one was the most impressive to me. On the rotations, I can rotate in my mind and see the other view… but I find it very hard to color invert mentally

usrusr
0 replies
3h12m

For me it's the reverse: the color inversions feel hardly more impressive than the morph animations that were all the rage in the 1990, because while I certainly understand how straight-forward color inversion is on the level of pixel data, I still can't "see" that simplicity. It hardly looks any different than an alpha blend with no relation at all.

The rotations on the other hand, wow! It is perfectly visible how the pixels don't change. You can physically rotate the screen and the image "changes". I could not think of a better illustration of how diffusion model images are not just echoes of preexisting images (they certainly are), but solutions to the problem of "find a set of pixels that will match the description of {prompt}". Or in this case, "that will match {A} when oriented this way and {B} when oriented that way".

gitgud
0 replies
15h5m

That is amazing, here's the link for anyone interested (there's a lot of images on that page)

https://dangeng.github.io/visual_anagrams/static/videos/grid...

aunwick
2 replies
14h59m

So, Im grad school I had access to an sgi onyx and basically did this but didn't toot my horn about it because. 1. I didn't think it was particularly amazing 2. We didn't have social platforms yet.

Congratulations!

mkl
0 replies
10h49m

An SGI Onyx has a tiny fraction of the computing power needed to run text-to-image generative models like this.

DonHopkins
0 replies
4h32m

How do you pay for all that electricity?

Nition
2 replies
18h32m

The duck/rabbit that rearranges would be really cool to use on one of those sliding puzzles. Two valid solutions!

kurthr
0 replies
9h52m

With that many rearrangeable elements, you could make so many different "valid" solutions, indistinguishable without a photograph, that it would become art rather than a puzzle.

bertil
0 replies
6h15m

I’d need to check, but if one set of “ear and hole” can be swapped with another set, both sets have to be identical in shape and color. But if they split and attach to other edges rather than swap, that creates further connection.

If you think of the edges as nodes in a connected di-graph of ears and holes, possible pairs are connected: a swap is a two-pair cluster; further connection is a four-element chain with both ends open-ended. If that connection ties to more pairs, you might have a larger cluster of identical hears and holes. Given graph properties, that’s presumably most of them — see the prisoners paradox for why [0].

That would make the puzzle much more challenging to solve if most ears fit in most holes.

[0] The excellent Matt Parker https://www.youtube.com/watch?v=a1DUUnhk3uE but I recommend the following debate with Derek from Veritasium.

yeldarb
1 replies
3h53m

I'd love one of these on my wall. Imagining a framed version of the Einstein pop-art one where the circle in the middle rotates (either periodically or via a manual lever).

rafabulsing
0 replies
29m

The color inversion ones would work well with an E-Ink display.

rob74
1 replies
9h9m

As usual with AI-generated artwork: looks nice at first sight, but if you look closer, you can't help but notice the flaws. E.g. the ambigrams: in the "happy"/"holiday" one, the second word is actually missing the "i", and the two "blessing"s are really hard to read. Also, the "campfire man"'s face seems to be melting in a very disconcerting way...

belugacat
0 replies
9h2m

I'm a photographer, and for years I've been pixel peeping at photos taken on phones with "portrait mode"; many years after the first introduction of the feature, regardless of the implementation, results still look crummy to my eye.

Looking at fine elements like hairs (nevermind curly hair) is a disaster, especially when you're used to fine classic german/japanese optics that accurately reproduce every subtle detail of a subject while having extremely aesthetically pleasing sharpness falloff/bokeh.

I've had to swallow the pill though: No one (end users; pros are another story) cares about those details. People just want something that vaguely looks good in the immediate moment, and then it's on to the next thing.

I suspect it'll remain the same for AI generated visuals; a sharp eye will always be able to tell, but it won't really matter for consumption by the masses (where the money is).

dwighttk
1 replies
17h2m

Every single one of the examples is like "yeah... I mean, I guess... sorta"

the penguin/giraffe is probably the best one. The old lady/dress barely looks like either.

bertil
0 replies
5h53m

Those two are based on previously known ambigraphs:

* very closely https://www.pinterest.com/pin/giraffepenguin--13398215764267...

* or directly inspired by, but the “young lady” prompt triggered the model to pick a dress, and there’s no way to make an eye and an ear or a month and a chocker photo-realistically identical: https://www.reddit.com/r/RedditDayOf/comments/35cjn5/the_cla...

cloudyporpoise
1 replies
18h59m

This may be one of the cooler things i've ever seen

adkaplan
0 replies
18h2m

some of these style illusions I've seen drawn by hand before, but the lithopane ones are new to me. I'm sure the 3d printing lithopane community will love them

IIAOPSW
1 replies
17h5m

I feel like a neural network is probably overkill for this task and a suboptimal substitute for a theoretical understanding of optical illusions, but can't argue with results.

bertil
0 replies
5h56m

Most of them are not “illusions” where you perceive two identical segments being different lengths because of tricks of human perception, they are ambigrams. They rely on humans’ ability to think of any three dots as two eyes and a mouth.

They also “copy” the way those networks seem to do so often that they somehow get copyright strikes; they were either prompted on existing solutions or learned them whole through training:

* The penguin and giraffe one is a previously known ambigram, for example.

* The old lady turning into a dress is obviously based on a classic pencil drawing where a similar old lady hiding in her collar turns into a young lady looking behind her shoulder [0]; however, the network interpreted “young lady” and turned into a white dress because color-matching the two different body parts from the pencil outline and turning it photorealistic wouldn’t have been much harder otherwise. There are photorealistic interpretations, though [1].

I’m more impressed by the radically new ones, like the fire flipping into a face—but most of those rely on having two distinct parts of the image be meaningful in their own context, and not relevant otherwise.

The black-and-white inversion man/woman is impressive because the two interpretations are not on separate parts of the image. That’s where you can interpret the quality of the effect as the model having learned how humans perceive and pay attention to dark and light contrasts differently. That one captures an understanding of perception.

[0] https://www.reddit.com/r/RedditDayOf/comments/35cjn5/the_cla...

[1] https://www.jagranjosh.com/general-knowledge/optical-illusio...

moritzwarhier
0 replies
20h8m

This is wonderful.

kevinwang
0 replies
14h41m

Wow, these examples are amazing

guybedo
0 replies
14h36m

the explosion in creativity brought by generative AI truly is incredible.

cwkoss
0 replies
17h48m

Would be cool to make some of these that look like different things under red/blue light.

chrisweekly
0 replies
17h53m

I really enjoy these. Great post.

DonHopkins
0 replies
17h30m

From the HN "Boustrophedon" discussion:

https://news.ycombinator.com/item?id=15539373

https://en.wikipedia.org/wiki/Boustrophedon

https://news.ycombinator.com/item?id=15547162

DonHopkins on Oct 25, 2017 | prev | next [–]

Scott Kim has a wonderful talent at designing "ambigrams". Check out his classic book "Inversions" and his gallery of more recent work!

http://www.scottkim.com.previewc40.carrierzone.com/inversion...

An inversion is a word or name written so it reads in more than one way. For instance, the word Inversions above is my name upside down. Douglas Hofstadter coined ambigram as the generic word for inversions. I drew my first inversion in 1975 in an art class, wrote a book called Inversions in 1981, and am now doing animated inversions.

A Scott Kim Ambigram for "George Hart":

https://www.georgehart.com/scott-kim.html

John Maeda's Blog: Scott Kim’s Ambigrams

https://maeda.pm/2017/12/17/scott-kims-ambigrams/

The Inversions of Scott Kim:

https://www.anopticalillusion.com/2012/04/the-inversions-of-...

Channel: An Optical Illusion » scott kim:

https://optical397.rssing.com/chan-26600952/index-latest.php

Scott Kim’s symmetrical alphabet:

https://stancarey.wordpress.com/2012/10/18/scott-kims-symmet...

Typography Two Ways: Calligraphy With a Twist

https://www.wired.com/2009/05/pl-arts-6/