Visual Anagrams: Generating optical illusions with diffusion models

Note that this technique and its results are unrelated to the infamous "spiral" ControlNet images a couple months back: https://arstechnica.com/information-technology/2023/09/dream...

Per the code, the technique is based off of DeepFloyd-IF, which is not as easy to run as a Stable Diffusion variant.

I missed it, what was infamous about it?

It created a backlash because a) it was too popular with AI people hyping "THIS CHANGES EVERYTHING!" and people were posting low-effort transformations to the point that it got saturated and b) non-AI people were "tricked" into thinking it was a clever trick with real art since ControlNet is not ubiquitous outside the AI-sphere, and they got mad.

I rather liked it and actually didn't get to see as many examples as I wanted to.

Is there a good repository anywhere or is it just "wade through twitter"?

not a repository as such but i linked to some good examples in my sept recap

https://www.latent.space/p/sep-2023

https://github.com/swyxio/ai-notes/blob/main/Monthly%20Notes...

It is real art.

Did you mean to say it's _related_? The original "spiral" image by Ugleh is explicitly credited in the "Related Links" section.

It’s a similar topic which is why they credit it but the mechanism is much different.

Per the code, the technique is based off of DeepFloyd-IF, which is not as easy to run as a Stable Diffusion variant.

I haven't dug in yet, but it _should_ be possible to use their ideas in other diffusion networks? It may be a non-trivial change to the code provided though. Happy to be corrected of course.

I suspect the trick only works because DeepFloyd-IF operates in pixel space while other diffusion models operate in the latent space.

Our method uses DeepFloyd IF, a pixel-based diffusion model. We do not use Stable Diffusion because latent diffusion models cause artifacts in illusions (see our paper for more details).

I always thought it was weird that this idea took off with that particular controlnet model. Many other controlnet models when combined with those same images produce excellent and striking results.

The ecosystem around Stable Diffusion in general is so massive.

Other ControlNet adapters either preserve the high-level shape not enough or preserve it too well, IMO. Canny/Depth ControlNet generations are less of an illusion.

This colab notebook requires a high RAM and V100 GPU runtime, available through Colab Pro.

That's sad, I would've loved to try it.

Have you never put quarters into a PacMan machine?

i take cars for test drives before buying them

Do you hang out a GameStop all day and test drive cars in GTA instead of renting a game about stealing them?

Aren't you sad they don't just let you shoplift it for free?

All day? No. For 5 min to see if I like the game? Sure.

Do you also think it's sad you can't sneak into Disneyland for 5 minutes for free just to see if there are any streakers in It's a Small World?

I'm getting the impression you're just an entitled gamer who wants a free ride from the University of Michigan, not a professional programmer or AI developer who would actually get some tangible value out of subscribing to ChatGPT for $20 a month. I'm thankful to be alive in a time I can so conveniently get so much value for so little cash.

Is that the case? Is $10 really too much to ask to use a high-end GPU for a month? Then it's not really as sad and hopeless as you complain it is. Just be a good boy all year, ask Santa for an GeForce RTX 4090 for Christmas, leave some cookies and milk out for him, and hope your parents get the hint!

hah wrong on all accounts. not a gamer, yes a professional "programmer", yes paying $20 for chatgpt.

people don't pay for things without getting a feel for what they're getting. hence the huge focus in saas on various monetisation strategies. if someone puts these anagrams in a product, it will be freemium or have a tree tier, and then i will play with it.

there are 20 new projects like this every day, i'm not going to pay for all of them just to try them. i'll try the product if/when there is one

So don't buy a V100 then, and test it for a few bucks online somewhere. If you want other's to provide yo that hardware for free, with no chance of you actually buying you just come across as entitled.

I completely disagree. It's fantastic that we can get access to this hardware for so cheap. A used V100 is $1300. You could pay for Colab Pro for 10 years with that, which will get you faster and faster hardware through the years. Where I am, a month is the cost of two bags of chips.

well - chuck $10 at it and spend the rest of your month trying other things.

(Back in Disco Diffusion days I was happy to spend money on Colab Pro. It was fun)

I really like the man/woman inversion.

I wonder how many permutations could legibly be generated in a single image with an extended version of the same technique. I don't understand the math, but would two orthogonal transformations in sequence still be an orthogonal transformation and thus work?

I’m not sure whether ‘orthogonal transformations’ in this context refers to the usual orthogonal linear transformations (/matrices), but if so then yes.

The article explicitly specifies orthogonal matrices.

I saw that, but I’ve come across strange reuses of terminology before so I didn’t want to assume.

The mosaics of a duck and a rabbit, however, was hilarious.

The man/woman one stuck out to me as well. I probably watched it ten times. Probably because it seems so forlorn.

I had a similar idea early last year and also dabbled with a checkerboard approach.

Here a cat is made from 9 paintings of cats in the style of popular painters:

https://twitter.com/marekgibney/status/1521500594577584141

You might have to squint your eyes to see it.

I made a few of them and then somehow lost interest.

That's really cool. Can you do 3x3x3? As in, 9x9 with 81 1-cell cats, 9 9-cell cats and 1 81-cell cat?

That could be interesting. A recursive cat, so to say.

The problem would be this: In the picture at hand, the big cat is rather simple. Just a portrait of a smiling cat. While the 9 smaller cats are doing all kinds of poses to adjust to the form of the big cat portrait. So the subcats are more complex than the main cat.

When doing the recursive cat, it would be hard to make a subcat from 9 subsubcats because the subcat is already a complex image that is not as easy to recognise as the main cat.

This thread reminded me of this old gem: https://thesecatsdonotexist.com/ (warning: you may see some catspiders / r/Imsorryjon material!)

Now what would be interesting is a "demixer" which allows you to locate the source image(s) from multiple interations of a given image. Like a reverse image search but for generative images. I suppose it would rely on artefact matching or some other kind of granular pattern matching, along with other more general methods (assuming the source material is actually available online in the first place).

That looks more like a cat-aclysm to me TBH. Probably the model was overwhelmed by the conflicting requirements, so that neither the individual images nor the composite image are particularly good. But, as you wrote, maybe they will get better at this eventually...

Do real-life jigsaw puzzles like the ones shown here, exist for purchase?

This research uses DeepFloyd IF, which forbids commercial use. They'd need to find/train another suitable image generator.

I’m curious how they even thought of the idea to train a jigsaw puzzle like that in the first place. My naive guess was that those types of puzzles were preexisting. If in fact it’s a novel type of puzzle, that idea in itself is as cool as the generator they created!

I’m curious if they could ask for permission from the original authors (who doesn’t love a fun puzzle?—and it’s not like the profit motive here is alarming): most licenses are default permission.

You can always reach out and ask for a one-off in good faith.

You could always make it yourself! Not sure how well the method above would scale up though https://www.createjigsawpuzzles.com/

The man/woman color inversion one was the most impressive to me. On the rotations, I can rotate in my mind and see the other view… but I find it very hard to color invert mentally

For me it's the reverse: the color inversions feel hardly more impressive than the morph animations that were all the rage in the 1990, because while I certainly understand how straight-forward color inversion is on the level of pixel data, I still can't "see" that simplicity. It hardly looks any different than an alpha blend with no relation at all.

The rotations on the other hand, wow! It is perfectly visible how the pixels don't change. You can physically rotate the screen and the image "changes". I could not think of a better illustration of how diffusion model images are not just echoes of preexisting images (they certainly are), but solutions to the problem of "find a set of pixels that will match the description of {prompt}". Or in this case, "that will match {A} when oriented this way and {B} when oriented that way".

That is amazing, here's the link for anyone interested (there's a lot of images on that page)

https://dangeng.github.io/visual_anagrams/static/videos/grid...

So, Im grad school I had access to an sgi onyx and basically did this but didn't toot my horn about it because. 1. I didn't think it was particularly amazing 2. We didn't have social platforms yet.

Congratulations!

An SGI Onyx has a tiny fraction of the computing power needed to run text-to-image generative models like this.

How do you pay for all that electricity?

The duck/rabbit that rearranges would be really cool to use on one of those sliding puzzles. Two valid solutions!

With that many rearrangeable elements, you could make so many different "valid" solutions, indistinguishable without a photograph, that it would become art rather than a puzzle.

I’d need to check, but if one set of “ear and hole” can be swapped with another set, both sets have to be identical in shape and color. But if they split and attach to other edges rather than swap, that creates further connection.

If you think of the edges as nodes in a connected di-graph of ears and holes, possible pairs are connected: a swap is a two-pair cluster; further connection is a four-element chain with both ends open-ended. If that connection ties to more pairs, you might have a larger cluster of identical hears and holes. Given graph properties, that’s presumably most of them — see the prisoners paradox for why [0].

That would make the puzzle much more challenging to solve if most ears fit in most holes.

[0] The excellent Matt Parker https://www.youtube.com/watch?v=a1DUUnhk3uE but I recommend the following debate with Derek from Veritasium.

I'd love one of these on my wall. Imagining a framed version of the Einstein pop-art one where the circle in the middle rotates (either periodically or via a manual lever).

The color inversion ones would work well with an E-Ink display.

As usual with AI-generated artwork: looks nice at first sight, but if you look closer, you can't help but notice the flaws. E.g. the ambigrams: in the "happy"/"holiday" one, the second word is actually missing the "i", and the two "blessing"s are really hard to read. Also, the "campfire man"'s face seems to be melting in a very disconcerting way...

I'm a photographer, and for years I've been pixel peeping at photos taken on phones with "portrait mode"; many years after the first introduction of the feature, regardless of the implementation, results still look crummy to my eye.

Looking at fine elements like hairs (nevermind curly hair) is a disaster, especially when you're used to fine classic german/japanese optics that accurately reproduce every subtle detail of a subject while having extremely aesthetically pleasing sharpness falloff/bokeh.

I've had to swallow the pill though: No one (end users; pros are another story) cares about those details. People just want something that vaguely looks good in the immediate moment, and then it's on to the next thing.

I suspect it'll remain the same for AI generated visuals; a sharp eye will always be able to tell, but it won't really matter for consumption by the masses (where the money is).

Every single one of the examples is like "yeah... I mean, I guess... sorta"

the penguin/giraffe is probably the best one. The old lady/dress barely looks like either.

Those two are based on previously known ambigraphs:

* very closely https://www.pinterest.com/pin/giraffepenguin--13398215764267...

* or directly inspired by, but the “young lady” prompt triggered the model to pick a dress, and there’s no way to make an eye and an ear or a month and a chocker photo-realistically identical: https://www.reddit.com/r/RedditDayOf/comments/35cjn5/the_cla...

This may be one of the cooler things i've ever seen

some of these style illusions I've seen drawn by hand before, but the lithopane ones are new to me. I'm sure the 3d printing lithopane community will love them

I feel like a neural network is probably overkill for this task and a suboptimal substitute for a theoretical understanding of optical illusions, but can't argue with results.

Most of them are not “illusions” where you perceive two identical segments being different lengths because of tricks of human perception, they are ambigrams. They rely on humans’ ability to think of any three dots as two eyes and a mouth.

They also “copy” the way those networks seem to do so often that they somehow get copyright strikes; they were either prompted on existing solutions or learned them whole through training:

* The penguin and giraffe one is a previously known ambigram, for example.

* The old lady turning into a dress is obviously based on a classic pencil drawing where a similar old lady hiding in her collar turns into a young lady looking behind her shoulder [0]; however, the network interpreted “young lady” and turned into a white dress because color-matching the two different body parts from the pencil outline and turning it photorealistic wouldn’t have been much harder otherwise. There are photorealistic interpretations, though [1].

I’m more impressed by the radically new ones, like the fire flipping into a face—but most of those rely on having two distinct parts of the image be meaningful in their own context, and not relevant otherwise.

The black-and-white inversion man/woman is impressive because the two interpretations are not on separate parts of the image. That’s where you can interpret the quality of the effect as the model having learned how humans perceive and pay attention to dark and light contrasts differently. That one captures an understanding of perception.

[0] https://www.reddit.com/r/RedditDayOf/comments/35cjn5/the_cla...

[1] https://www.jagranjosh.com/general-knowledge/optical-illusio...

This is wonderful.

Wow, these examples are amazing

the explosion in creativity brought by generative AI truly is incredible.

Would be cool to make some of these that look like different things under red/blue light.

I really enjoy these. Great post.

From the HN "Boustrophedon" discussion:

https://news.ycombinator.com/item?id=15539373

https://en.wikipedia.org/wiki/Boustrophedon

https://news.ycombinator.com/item?id=15547162

DonHopkins on Oct 25, 2017 | prev | next [–]

Scott Kim has a wonderful talent at designing "ambigrams". Check out his classic book "Inversions" and his gallery of more recent work!

http://www.scottkim.com.previewc40.carrierzone.com/inversion...

An inversion is a word or name written so it reads in more than one way. For instance, the word Inversions above is my name upside down. Douglas Hofstadter coined ambigram as the generic word for inversions. I drew my first inversion in 1975 in an art class, wrote a book called Inversions in 1981, and am now doing animated inversions.

A Scott Kim Ambigram for "George Hart":

https://www.georgehart.com/scott-kim.html

John Maeda's Blog: Scott Kim’s Ambigrams

https://maeda.pm/2017/12/17/scott-kims-ambigrams/

The Inversions of Scott Kim:

https://www.anopticalillusion.com/2012/04/the-inversions-of-...

Channel: An Optical Illusion » scott kim:

https://optical397.rssing.com/chan-26600952/index-latest.php

Scott Kim’s symmetrical alphabet:

https://stancarey.wordpress.com/2012/10/18/scott-kims-symmet...

Typography Two Ways: Calligraphy With a Twist

https://www.wired.com/2009/05/pl-arts-6/