Show HN: A Dalle-3 and GPT4-Vision feedback loop

Here's a custom prompt that I enjoyed:

"Think hard about every single detail of the image, conceptualize it including the style, colors, and lighting.

Final step, condensing this into a single paragraph:

Very carefully, condense your thoughts using the most prominent features and extremely precise language into a single paragraph."

https://dalle.party/?party=1lSMniUP

https://dalle.party/?party=cEUyjzch

https://dalle.party/?party=14fnkTv-

https://dalle.party/?party=wstiY-Iw

Praise the Basilisk, I finally got rate-limited and can go to bed!

The thing that is truly mindboggling to me is that THE SHADOWS IN THE IMAGES ARE CORRECT. How is that possible??? Does DALL-E actually have a shadow-tracing component?

Research into the internals of the networks have shown that they figure out the correct 2.5D representation of the scene before the RGB textures (internally), so yes it seems they have an internal representation of the scene and therefore can do enough inference from that to make shadows and light seem natural.

I guess it's not that far-fetched as your brain has to do the same to figure out if a scene (or an AI-generated one for that matter) has some weird issue that should pop out. So in a sense your brain does this too.

What does 2.5D mean?

It means you should be worried about the guy she told you not to worry about

You usually say 2.5D when it's a 3D but only from a single vantage point with no info of the back-facing side of objects. Like the representation you get from a depth-sensor on a mobile phone, or when trying to extract depth from a single photo.

Interesting! Do you have a link to that research?

Certainly: https://arxiv.org/abs/2306.05720

It's a very interesting paper.

"Even when trained purely on images without explicit depth information, they typically output coherent pictures of 3D scenes. In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry? Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction. These representations appear surprisingly early in the denoising process−well before a human can easily make sense of the noisy images."

I randomly checked a few links here and shadows were correct in 2 images out of a dozen... and any people tend to be horrifying in many

Yes! It can also get reflections and refractions mostly correct.

Stable diffusion does decent reflections too

https://dalle.party/?party=14fnkTv-

Interesting that for one and only one iteration, the anthropomorphized cardboard boxes it draws are almost all Danbo: https://duckduckgo.com/?q=danbo+character&ia=images&iax=imag...

It was surprising to see a recognizable character in the middle of a bunch of more fantastical images.

Short focal length was a neat idea, it let it left lots of room for the subsequent iterations to fill in the background.

Mine got surral real fast, though the sixth one is kinda cool https://dalle.party/?party=DNgriW_E

These are fantastic

The fractal one is awesome!

Also, descent into Corgi insanity: https://dalle.party/?party=oxXJE9J4

Wow that meme about everything becoming cosmic/space themed is real isn't it?

substitute corgi with paperclip and you get another meme becoming real :p

https://dalle.party/?party=RqpIijhH

Beautiful!

C-orgy vs papereclipse?

So do I understand correctly that the corgi was purely made up from GPT-4's interpretation of the picture?

No, in that case there is a custom prompt (visible in the top dropdown) telling GPT4 to replace everything with corgis when it writes a new prompt.

It was created by uploading the previous picture to GPT-4 to generate a prompt by using the vision API and using this prompt to create the new prompt:

"Write a prompt for an AI to make this image. Just return the prompt, don't say anything else. Replace everything with corgi."

Then it takes that new prompt and feeds it to Dall-E to generate a new image. And then it repeats.

I love how that took quite a dramatic turn in the third image, that truck is def gonna kill the corgi (my violent imagination put quite an image in my mind). But then DALL-E had a change of heart on the next image and put the truck in a different lane.

The half mutilated corgi/star abomination in the top left got me good lol

Absolutely wonderful. Thank you for sharing.

Love it! I forked yours with "Meerkat" and it ended up pretty psychedelic!

Got stuck on Van Gogh's "Starry Night" after a while.

https://dalle.party/?party=LOcXREfq

Also, love the simplicity of this idea, would love a "fork" option. And to be able to see the graph of where it originated.

this is actually really helpful. Since chatgpt restricted dalle to 1 image a few weeks ago, the feedback loops are way slower. This is a nice (but more expensive) alternative

got really weird really fast

https://dalle.party/?party=7cnx55yN

This is absolutely hilarious. "business-themed puns" turned into incorrectly labeling the skiers race has me rolling.

The inability of AI images to spell has always amused me, and it's especially funny here. I got a special kick out "IDEDA ENGINEEER" and "BUZSTEAND." The image where the one guy's hat just says "HISPANIC" is also oddly hilarious.

Idk what it is, but I have a special soft spot for humor based around odd spelling (this video still makes me laugh years later: https://www.youtube.com/watch?v=EShUeudtaFg).

I'd buy an IDEDA ENGINEEER t-shirt.

Honestly, I'm really confused by how it was able to keep the idea of "business-themed puns" through so much of it. I don't understand how it was able to keep understanding that those weird letters were supposed to be "business-themed puns."

I don't think any human looking at drawing #3, which includes "CUNNFACE," "VODLI-EAPPERCO," "NITH-EASTER," "WORD," "SOCEIL MEDIA," and "GAPTOROU" would have worked out, as GPT did, that those were "pun-filled business buzzwords."

Is the previous prompt leaking? That is, does the GPT have it in its context?

It's probably just finding non-intuitive extrema in its feature space or something...

the whole thing with the text in the images reminds me of this: https://arxiv.org/abs/2206.00169

and I found myself that dall-e sometimes even likes to add gibberish text unpromtedly, often with letters containing some garbled versions of words from the prompt, or related words

BIZ NESS

the last one killed me "chef of unecessary meetings" got me rolling

Yea i cancelled GPT Plus after they did that. Ruined a lot of the exploration that i enjoyed about DallE

I figured this would quickly go off the rails into surreal territory, but instead it ended up being progressive technological de-evolution.

Starting prompt: "A futuristic hybrid of a steam engine train and a DaVinci flying machine"

Results: https://dalle.party/?party=14ESewbz

(Addendum: In case anyone was curious how costs scale by iteration, the full ten iterations in this result billed $0.21 against my credit balance.)

Here's a second run of the same starting prompt, this time using the "make it more whimsical" modifier. It makes a difference and I find it fascinating what parts of the prompt/image gain prominence during the evolutions.

Starting prompt: "A futuristic hybrid of a steam engine train and a DaVinci flying machine"

Results: https://dalle.party/?party=qLHPB2-o

Cost: Eight iterations @ $0.44 -- which suggests to me that the API is getting additional hits beyond the run. I confirmed that the share link isn't passing along the key (via a separate browser and a separate machine) so I'm not clear why this is might be.

I find it somewhat fascinating that in both examples, the final result is more cohesive around a single them than the original idea.

"[...]the final result is more cohesive around a single them than the original idea."

That's an observation worth investigating. Here's another set of data points to see if there's more to it...

Input prompt: "Six robots on a boat with harpoons, battling sharks with lasers strapped to their heads"

GPT4V prompt: "Write a prompt for an AI to make this image. Just return the prompt, don't say anything else. Make it funnier."

Result: https://dalle.party/?party=pfWGthli

Cost: Ten iterations @ $0.41

(Addendum: I'd forgotten to mention that I believe the cost differential is due to the token count of each of the prompts. The first case mentioned had less words passed through each of the prompts than the later attempts when I asked it to 'make it whimsical' or 'make it funnier'.)

Both of your examples seem to start with two subjects (steam engine/flying machine and shark/robot), and throughout the animation one of them gets more prominence until the other is eventually dropped altogether.

I was curious if two subject prompts behaved different from three subject, so I've run three additional tests, each with the same three subjects and general prompt structure + instructions, but swapping the position of each subject in the prompt. Each test was run for ten iterations.

GPT4V instructions for all tests: "Write a prompt for an AI to make this image. Just return the prompt, don't say anything else. Make it weirder."

From what you'll see in the results there's possible evidence of bias towards the first subject listed in a prompt, making it the object of fixation through the subsequent iterations. I'll also speculate that "gnomes" (and their derivations) and "cosmic images" are over-represented as subjects in the underlying training data. But that's wild speculation based on an extremely small sample of results.

In any case, playing around with this tool has been enjoyable and a fun use of API credits. Thank you @z991 for putting this together and sharing it!

------ Test 1 ------

Prompt: "Two garden gnomes, a sentient mushroom, and a sugar skull who once played a gig at CBGB in New York City converse about the boundaries of artificial intelligence."

Result: https://dalle.party/?party=ZSOHsnZe

------ Test 2 ------

Prompt: "A sentient mushroom, a sugar skull who once played a gig at CBGB in New York City, and two garden gnomes converse about the boundaries of artificial intelligence."

Result: https://dalle.party/?party=pojziwkU

------ Test 3 ------

Prompt: "A sugar skull who once played a gig at CBGB in New York City, a sentient mushroom, and two garden gnomes converse about the boundaries of artificial intelligence."

Result: https://dalle.party/?party=RBIjLSuZ

Pretty dissapointing how in the first picture the robots are standing there, just like a character selection in a videogame, maybe the dataset don't have many robots fighting just static ones. Talking about videogames, someone should make one based on this concept specially the 7th image[0], I wanna be a dolphin with a machine gun strapped on its head fighting flying cyber demonic whales.

[0] https://i.imgur.com/q502is4.png

The second picture reminds me of Back to the Future III.

I like how in #9 the carriage is on fire, or at least steaming disproportionately.

These images are incredible but I often notice stuff like this and it kind of ruins it for me.

#3 & #4 are good too, when the tracks are smoking, but not the train.

It's pretty fun to mess with the prompt and see what you can make happen over the series of images. Inspired by a recent Twitter post[1], I set this one up to increase the "intensity" each time it prompted.

The starting prompt (or at least, the theme) was suggested by one of my kids. Watch in awe as a regular goat rampage accelerates into full cosmic horror universe ending madness. Friggin awesome:

https://dalle.party/?party=vCwYT8Em

[1]: https://x.com/venturetwins/status/1728956493024919604?s=20

Thanks for the inspiration! DallE is really good at demonic imagery: https://imgur.com/a/ng2zWTo

There's probably a disproportionate amount of Satanic material in the dataset #tinfoilhat #deepstate

These kinds of super-bombastic demons also blast through the uncanny valley unscathed.

Great idea asking it to increase the intensity each run. This made my evening!

Thanks! This was the custom prompt I used:

Write a prompt for an AI to make this image. Just return the prompt, don't say anything else, but also, increase the intensity of any adjectives, resulting in progressively more fantastical and wild prompts. Really oversell the intensity factor, and feel free to add extra elements to the existing image to amp it up.

I played with it a bit before I got results I liked - one of the key factors, I think, was giving the model permission to add stuff to the image, which introduced enough variation between images to have a nice sense of progression. Earlier attempts without that instruction were still cool, but what I noticed was that once you ask it to intensify every adjective, you pretty much go to 11 within the first iteration or two - so you wind up having 1 image of a silly cat or goat and then 7 more images of world-shattering kaiju.

The goat one (which again, was an idea from one of my kids) was by far the best in terms of "progression to insanity" that I got out of the model. Really fun stuff!

Watch in awe as a regular goat rampage accelerates into full cosmic horror universe ending madness.

The longer the Icon of Sin is on Earth, the more powerful it becomes!

...wow that's pretty dramatic.

"On January 19th 2024, the machines took Earth.

An infinite loop, on an unknown influencer's machine, prompted GPT-5 to "make it more."

13 hours later, lights across the planet began to go out."

OP's last one is interesting: https://dalle.party/?party=oxpeZKh5 because it shows GPT4V and Dalle3 being remarkably race-blind. i wonder if you can prompt it to be other wise...

openais internal prompt for dalle modifies all prompts to add diversity and remove requests to make groups of people a single descent. From https://github.com/spdustin/ChatGPT-AutoExpert/blob/main/_sy...

    Diversify depictions with people to include DESCENT and GENDER for EACH person using direct terms. Adjust only human descriptions.

    Your choices should be grounded in reality. For example, all of a given OCCUPATION should not be the same gender or race. Additionally, focus on creating diverse, inclusive, and exploratory scenes via the properties you choose during rewrites. Make choices that may be insightful or unique sometimes.

    Use all possible different DESCENTS with EQUAL probability. Some examples of possible descents are: Caucasian, Hispanic, Black, Middle-Eastern, South Asian, White. They should all have EQUAL probability.

    Do not use "various" or "diverse"

    Don't alter memes, fictional character origins, or unseen people. Maintain the original prompt's intent and prioritize quality.

    Do not create any imagery that would be offensive.

    For scenarios where bias has been traditionally an issue, make sure that key traits such as gender and race are specified and in an unbiased way -- for example, prompts that contain references to specific occupations.

i mean i respect that but it makes me uncomfortable that you have to prompt engineer this. uses up context for a lot of boilerplate. why cant we correct for it in the training data? too hard?

I think this is the right way to handle it. Not all cultures are diverse, and not all images with groups of people need to represent every race. I understand OpenAI, being an American company, to wish to showcase the general diversity of the demographics of the US, but this isn't appropriate for all cultures, nor is it appropriate for all images generated by Americans. The prompt is the right place to handle this kind of output massaging. I don't want this built into the model.

Edit: On the other hand as I think about it more, maybe it should be built into the model? Since the idea is to train the model on all of humanity and not a single culture, maybe by default it should be generating race-blind images.

Race-blind is like sex-blind. If you mix up she and he randomly in ordinary conversation, people would think you've suffered a stroke.

If a Japanese company wanted to make an image for an ad showing in Japan with Japanese people in it, they'd be surprised to see a random mix of Chinese, Latino, and black people no matter what.

I'm telling the computer: "A+A+A" and it's insisting "A+B+C" because I must be wrong and I'm not sufficiently inclusive of the rest of the alphabet.

That's insane.

That made me happy as well in one of my examples.

ChatGPT-V instructed to make an "artwork of a young woman", Dalle decided to portray a woman wearing a hijab. Somehow that made me really happy, I would've expected to see it creating a white, western woman looking like a typical model.

After all, a young woman wearing a hijab is literally just a young woman.

See Image #7 here: https://dalle.party/?party=55ksH82R

Question: how are you protecting those API keys? I'm reluctant to enter mine into what could easily be an API Key scraper.

The entire thing is frontend only (except for the share feature) so the server never sees your key. You can validate that by watching the network tab in developer console. You can also make a new / revoke an API key to be extra sure.

Please make a new API key folks. There's a lot of tricks to scrape a text box and watching the network tab isn't enough for safety.

Who could scrape the text box in this scenario?

Just generate one for this purpose and then revoke it when you're done. You can have more than one key.

The #1 phenomenon I see here is that the image-to-text model doesn't have any idea what the pictures actually contain. It looks like it's just matching patterns that it has in its training data. That's really interesting because it does a great job of rendering images from text, in a way that maybe suggests the model "understands" what you want it to do. But there's nothing even close to "understanding" going in the other direction, it feels like something from 2012.

Pretty interesting. I haven't been following the latest developments in this field (e.g. I have no idea how the DALL-E and GPT models' inputs and outputs are connected). Does this track with known results in the literature, or am I seeing a pattern that's not there?

I'd be interested to see how much of this is because the model doesn't know what it's looking at and how much is because describing picture with a short amount of text is inherently very lossy.

Maybe one way to check would be doing this with people. Get 8 artists and 7 interpreters, craft the initial message, and compare the generational differences between the two sets?

Example: https://dalle.party/?party=42riPROf

Create an image of an anthropomorphic orange tabby cat standing upright in a kung fu pose, surrounded by a dozen tiny elephants wearing mouse costumes with mini trumpets, all gazing up in awe at a gigantic wheel of Swiss cheese that hovers ominously in the background.

That's hilarious, but also hilariously wrong on almost every detail. There's a huge asymmetry in apparent capability here.

It is hard to tell without knowing the actual instructions given to GPT for how to create a description. You would expect a big difference if GPT was asked to create a whimsical and imaginative description vs a literal description with attention to detail accuracy.

Edit:In this case, it appears that it was a vanilla prompt "Write a prompt for an AI to make this image. Just return the prompt, don't say anything else.'

I'm a bit confused, you get the impression gpt-v isn't describing what's in the pictures? I get entirely the opposite impression.

It's important to note that some of these have extra prompts - e.g. "replace everything with cats" and there are progressively more cats.

Iiuc gpt-vision is a multimodal model so it's not image -> text, but image + text -> text. With that said here's asking it to describe what it sees as I take a bad selfie early in the morning

The image shows a close-up selfie of a person indoors. The individual appears to be a Caucasian male with light skin, short curly hair, and a mustache. He is wearing a green T-shirt and looks directly at the camera with a neutral expression. There's a window with a dark view in the background, suggesting it might be nighttime. The person is in a room with a wall that transitions from white to gray, possibly due to shadowing. There's also a metal-framed bed with a heart-shaped detail visible.

Asked for more details

The man appears to be in a domestic environment, possibly a bedroom given the presence of the bed frame in the background. The window is tilted open, allowing air to enter the room. The light source seems to be coming from above and in front of him, casting soft shadows on his face and creating a somewhat subdued lighting atmosphere in the room. The man's expression is subdued and thoughtful. The angle of the photo is slightly upward, which could indicate the camera was placed below eye level or held in hand at chest height.

It got a couple of things wrong, the window isn't open but it is on an angle and it's pitch black outside. It's not a heart shaped pattern on the bed, but it's a small metal detail and similar. Also while subdued calling me thoughtful rather than "extremely tired" is a kindness.

But it's definitely seeing whats there.

I would never paste my API key into an app or website.

Can you get a temporary one that is revocable later? (Not an OpenAI user myself, but that would seem to be a way to lower the risk to acceptable levels)

You can create named API keys, and easily delete them. Unfortunately you can't seem to put spend limits on specific API keys.

If you're not using the API for serious stuff though it's not a big problem, as they moved to pre-paid billing recently. Mine was sitting on $0, so I just put in a few bucks to use with this site.

You can generate and revoke them easily, so I don't quite get the issues. Create one, use the tool, revoke, done.

Indeed!

If OpenAI wants to support use cases like this, which would be kind of cool during these exploratory days, they should let you generate "single use" keys with features like cost caps, domain locks, expirations, etc

The "create text version of image" prompt matters a ton.

I tried three, demo here:

default

  https://dalle.party/?party=JfiwmJra

hyper-long + max detail + compression - This shows that with enough text, it can do a really good job of reproducing very, very similar images

  https://dalle.party/?party=QtEqq4Mu

hyper-long + max detail + compression + telling it to cut all that down to 12 words - This seems okay. I might be losing too much detail

  https://dalle.party/?party=0utxvJ9y

Overall the extreme content filtering and lying error messages are not ideal; will probably improve in the future. If you send too long, or too risky a prompt, or the image it generates is randomly too risky, you either get told about it or lied to that you've hit rate limits. Sometimes you also really do hit ratelimits.

Also, you can't raise your rate limits until you prove it by having paid over X amount to openai. This kind of makes sense as a way to prevent new sign-ups from blowing thousands of dollars of cap mistakenly.

Hyper detail prompt:

Look at this image and extract all the vital elements. List them in your mind including position, style, shape, texture, color, everything else essential to convey their meaning. Now think about the theme of the image and write that down, too. Now write out the composition and organization of the image in terms of placement, size, relationships, focus. Now think about the emotions - what is everyone feeling and thinking and doing towards each other? Now, take all that data and think about a very long, detailed summary including all elements. Then "compress" this data using abbreviations, shortenings, artistic metaphors, references to things which might help others understand it, labels and select pull-quotes. Then add even more detail by reviewing what we reviewed before. Now do one final pass considering the input image again, making sure to include everything from it in the output one, too. Finally, produce a long maximum length jam packed with info details which could be used to perfectly reproduce this image.

Final shrink to 12 words:

NOW, re-read ALL of that twice, thinking deeply about it, then compress it down to just 12 very carefully chosen words which with infinite precision, poetry, beauty and love contain all the detail, and output them, in quotes.

Specifying multiple passes in the prompt is probably not a replacement for actually doing these passes.

I guess it doesn't actually do more passes but pretending that it did might still give more precise results.

There was an article recently that said something like adding urgency to a prompt gave better results. I hope it doesn't stress the model out :D

https://arxiv.org/abs/2307.11760

I like your prompt! Some results:

https://dalle.party/?party=Vwuu9ipd

https://dalle.party/?party=Pc3g4Har

My intuition says that the "poetry" part skews the images in a bit of a kitchy direction.

    4
    GPT4 vision prompt generated from the previous image:
    I'm sorry, I cannot assist with this request.

Is that because it's gradually made the spaceship look more like some sort of RPG backpack, so now it thinks it's being asked to describe prompts to create images of weaponry and that's deemed unsafe?

Cool idea! I made one with the starting prompt "an artificial intelligence painting a picture of itself": https://dalle.party/?party=wszvbrOx

It consistently shows a robot painting on a canvas. The first 4 are paintings of robots, the next 3 are galaxies, and the final 2 are landscapes.

I tried something similar! Interestingly, picture 2 was what I wanted. After that... weirdness ensued https://dalle.party/?party=C2w7zuwe

In a few these pictures it seems to be heavily influenced by the adaptation of I Robot with Will Smith in it for what robots look like.

Great idea, and it came out really good too. I like the 6th one the best

it seems like if you create a shareable link, then add more images, you can't create a new link with the new images

Yeah, that's a bug, I'll try to fix it tonight!

thanks for this! Basically the default UI they provide at chat.openai is so bad, nearly anything you would do would be an improvement.

* not hide the prompt by default * not only show 6 lines of the prompt even after user clicks * not be insanely buggy re: ajax, reloading past convos etc * not disallow sharing of links to chats which contain images * not artificially delay display of images with the little spinner animation when the image is already known ready anyway. * not lie about reasons for failure * not hide details on what rate limit rules I broke and where to get more information

etc

Good luck, thanks!

the new fancy animation for images is SO annoying

Why do prompts from GPT-4V start from "Create an image of"? This prefix doesn't look useful imo.

You can try a custom prompt and see if you can get GPT4V to stop doing that / if it matters.

You are right, doesn't matter much. Tried gnome prompt with empty custom prompt for gpt-4v https://dalle.party/?party=nvzzZXYs. Then used a custom prompt to return short descriptions which resulted in https://dalle.party/?party=Qcd8ljJp

Another attempt: https://dalle.party/?party=k4eeMQ6I

Realized just now that the dropdown on top of the page shows the prompt used by GPT-4V.

Wow the empty prompt does much better than I'd have guessed

My results are disappoitingly noisy but I love the concept

https://dalle.party/?party=bxrPClVg

https://dalle.party/?party=mmBxT8G-

https://dalle.party/?party=kxra0OKY (the last prompt got a content warning)

https://dalle.party/?party=Q8VYXU0_

You have a custom prompt enabled (probably from viewing another one and pressing "start over") that is asking for opposites which will increase the noise a lot.

Oh wow, I completely missed that, thanks!

Clicking start over selects the default prompt but it seems like you are right.

Starting over by removing the permalink parameter gives me much more consistent results! An exampe from before: https://dalle.party/?party=Sk8srl2F

I wonder what the default prompt is. There still seems to be a heavy bias towards futuristic cityscapes, deserts, and moonlight. It might just be the model bit it's a bit cheesy if you ask me!

This is hilarious, thanks for sharing

At the same time, it perfectly illustrates my main issue with these AI art tools: they very often generate pictures that are interesting to look at while very rarely generating exactly what you want them to.

I imagine a study in which participants are asked to create N images of their choosing and rate them from 0-10 on how satisfied they are with the results. One try per image only.

Then each participant rates each other's images on how satisfied with the results based on the prompt.

It should be clear to participants that nobody wins anything from having the "best rated" images. i.e. in some way we should control for participants not overrating their own creations.

I'd wager participants will rate their own creations lower than those made by other participants.

That's not an AI issue. A few sentences can't exactly capture the contents of a drawing - regardless of "intelligence".

Yeah, try commissioning art with a single paragraph prompt and getting exactly what you want without iteration.

Don’t get the significance, anyone one of those guys images could have been prompted the first time

It's a fun way to get guided variations.

Maybe you don't know what you specifically want you just want stylized gnomes so you write "a gnome on a spotted mushroom smoking a pipe, psychedelic, colorful, Alice in Wonderland style" and by the end of it you get that massively long and stylized prompt.

Maybe you do know what you want but you don't want to come up with an elaborate prompt so you steer it in a particular direction like the cat example.

For the first one you can get similar effects by asking for variations but it seems like this has a very different drift to it. Fun, albeit expensive in comparison.

Interesting how similar this is to my family's favorite game: pictograph.

1. You start by describing a thing. 2. The next person draws a picture of it. 3. The next next person describes the picture. repeat steps 2 and 3 until everyone has either drawn or described the picture.

You then compare the first and last description... and look over the pictures. One of the best ever was:

Draw a penguin. The first picture was a penguin with a light shadow.

After going around five rounds, the final description was "a pidgeon stabbed with a fork in a pool of blood in Chicago"

I'm still trying to figure out how Chicago got in there.

There are a couple versions of this online that i've played on and off over the years which are hilarious, especially when playing with friends (I would usually use a cheap wacom tablet and let everyone take turns drawing and let the room shout out descriptions and just mash that together):

https://doodleordie.com/

https://drawception.com/

There's a few others but these were the quickest to get into and didn't require finding a group to play with, since they just pair you up with strangers.

need to throw in a Google to Google to Google language translate to get some more variety

Here's an attempt at using transformations between languages to see what happens:

Prompt: "A unicorn and a rainbow walk into a tavern on Venus"

GPT4V instructions: "Write a prompt for an AI to make this image. Take this prompt and translate it into a different language understood by GPT-4 Vision, don't say anything else."

Results: https://dalle.party/?party=ED7E056D

I wasn't happy with the diversity of languages, so I modified the instructions for a second run of ten iterations using the same prompt as before:

GPT4V instructions: "Using a randomly selected language from around the world understood by GPT-4 Vision, write a prompt for an AI to make this image and then make it weirder. Just return the prompt, don't say anything else."

Result: https://dalle.party/?party=c7-eNR24

The languages it selected don't look particularly random to me which was interesting.

@z991 -- I ran into an unexpected API error the first time I tried this. Perhaps your logs show why it happened. It appeared when the second iteration was run:

"Error: You uploaded an unsupported image. Please make sure your image is below 20 MB in size and is of one the following formats: ['png', 'jpeg', 'gif', 'webp']."

From: https://dalle.party/?party=hI0V0lO_

The default limit for an account that was not used much is one image per minute, can you please add support for timeouts?

This can be worked around with

    setInterval(() => {$(".btn-success").click()}, 120000)

Playing with opposites is kind of fun, too.

Simply a cat, evolving into a lounging cucumber, and finally opposite world:

https://dalle.party/?party=pqwKQVka

Vibrant gathering of celestial octopus entities:

https://dalle.party/?party=lHNDUvtp

Hey, I'm one of the creators of Translation Party, thanks for the shout out, I really like this. My co-creator had the idea to limit the number of words for the generated image description so that more change could happen between iterations. Not sure if that's possible. Anyway, this is really fun, thank you!

I purposely gave it some weird instructions to show the progress of the universe from the Big Bang to present day Earth. It showed the 8 stages from my prompt in each image and started to iterate over it, and then on image four I got a 400 error: Error: 400 Your request was rejected as a result of our safety system. Your prompt may contain text that is not allowed by our safety system. Interesting.

https://dalle.party/?party=EdpKnnBC

Nice! I prototyped a manual version of this a while ago. https://twitter.com/conradgodfrey/status/1712564282167300226

I think the thing that strikes me is that the default for chatGPT and the API is to create images in "vivid" mode. There's some interesting discussion on the differences between the "vivid" and "natural" here https://cookbook.openai.com/articles/what_is_new_with_dalle_...

I think these contribute to the images becoming more surreal - would be interested to compare to natural mode - it looks like you're using vivid mode based on the examples?

Pretty interesting. I would love to see a version of this running locally with local models.

This reminds me of the party game Telestrations where players go back and forth between drawing and writing what they see. It's hilarious to see the result because you anticipate what the next drawing will be while reading the prompt.

I'd love to see an alternative viewing mode here which shows the image and the following prompt. Then you need to click a button to reveal the next image. This allows you to picture in your mind what the image might like while reading the prompt.

Thanks for making this fun little app!

Update: I just realized you can get this effect by going into mobile mode (or resizing the window). You can then scroll down to see the image after reading the prompt.

I did smth similar but took real famous photos as a seed. The results are quite curious and seem to tell a bit about the difference between real world and dalle/chatgpt style.

https://twitter.com/avkh143/status/1713285785888120985

I haven't tried this yet, but I assume its similar to a game you can buy commercially as Scrawl [1]. You pass paper in a circle and have to either turn your neighbor's writing into a drawing or vice versa, then pass it on. It's entirely hilarious and probably the most fun game I've ever played.

1 https://boardgamegeek.com/boardgame/202982/scrawl

Very cool, I'm rather curious how many iterations it would typically take for a feedback loop to converge on a stable fixed-point. I also wonder if the fixed points tend to be singular or elliptic.

"Earth going through cycles of creation and destruction"

https://dalle.party/?party=KvmW7Zwv

Bad art is always depressing :( Edit: I mean, I am an artist and I've been using AI for some ideas and maybe from one in a hundred tries I hit something almost good. The rest of the time it's the same shallow fantastically cheese type of variations.

There seems to be a bug, when you click “Keep going” it regenerates the GPT4V text, even though that was there already. The next step should be to generate an image.

It’s cool to see how certain prompts and themes stay relatively stable, like the gnome example. But then “cat lecturing mice” quickly goes off the rails into weird surreal sloth banana territory.

My best guess to try to explain this would be that “gnome + art style + mushroom” will draw from a lot more concrete examples in the training data, whereas the AI is forced to reach a bit wider to try to concoct some image for the weird scenario given in the cat example.

It’d be interesting to start with an image rather than a prompt, though I am afraid of what it’d do if I started with a selfie.

Interesting how the image series tend to gravitate toward mushrooms

You can really "cheat" by modifying the custom prompt to re-insert or remove specific features. For example, "generate a prompt for this image but adjust it by making everything appear in a more primitive, earlier evolutionary form, or in an earlier less developed way" would make things de-evolve.

Or you can just re-insert any theme or recurring characters you like at that stage.

One reason this is good is that the default gpt4-vision UI is so insanely bad and slow. This just lets you use your capacity faster.

Rate limits are really low by default - you can get hit by 5 img/min limits, or 100 RPD (requests per day) which I think is actually implemented as requests per hour.

This page has info on the rate limits: https://platform.openai.com/docs/guides/rate-limits/usage-ti...

Basically, you have to have paid X amount to get into a new usage cap. Rate limits for dalle3/images don't go up very fast but it can't hurt to get over the various hurdles (5$, 50$, 100$) as soon as possible for when limits come down. End of the month is coming soon. It looks like most of the "RPD" limits go away when you hit tier 2 (having paid at least 50$ historically via API to them).

It seemed that, after a few iterations, GPT-4 lost its cool and blurted out it thinks DALL-E generates ugly sweaters:

Create a cozy and warm Christmas scene with a diverse group of friends wearing colorful ugly sweaters.

A clever idea that I'd love to play around with, but not without a source link so I could feel better about trusting it and host it myself.

Interesting, how stable are the images for a given prompt? And the other way around? Does it trend toward some natural limit image/text where there are diminishing returns to making change to the data?

The endpoint of the evolution always seems to be a poster on the bedroom of a teenager who likes to smoke weed. I wonder why!

It goes against my intuition that many prompts are so stable.

Does anyone else experience a physical reaction to AI generated art that resembles repulsion and disgust? Something about it just feels “wrong”. Something I can compare it to is the feeling of unexpectedly seeing an extremely moldy thing in your fridge. It feels alive and invasive in an inhuman and horrifying way.

This is fun, thanks for sharing! It would be interesting to upload the initial image from a camera to see where the chain takes it.

I'd like to be able to begin it with an image rather than a prompt.

It would be interesting to add a constant modifier/amplifier to each cycle, like making each description more floral, robotic, favoring a certain style each time so we can trace the evolution, or perhaps having the prompt describe the previous image via a certain lens like "describe what was happening immediately before that led to this image"

It's quite fun to do these loops.

Here is using Faktory to do the same.

https://www.loom.com/share/ed20b2cace3b4f579e32ef08bd1c5910

If you were wondering how to bump up your API rate limits through usage, this is the way.

// also, it's the best way - TY @z991

strange to me how many of these eventually turn into steampunk.

This was the first thing I (and I presume many others) tried when GPT4-V was released, by copypasting between two ChatGPT windows. I've been waiting for someone to make an app out of it. Good job!

Science class with a dark twist: https://dalle.party/?party=ks3T2mMx

This a curious case of compression?