Playing Telephone with Image AI

I’m sure you are familiar with the game of telephone. Someone to your left whispers a message in your ear. You need to internalize and memorize it enough to whisper it in the ear of someone on your right. When you have a chain of enough people, hilarity ensues. Each step mutates the text slightly. The message at the end of the chain is rarely anything like the starting message. There’s also a variant, Eat Poop You Cat, that borrows from Pictionary — alternating between images and image captions.

A fun Twitter account from a few years ago was @picdescbot. It took random pictures from Wikimedia, fed them through Microsoft’s Computer Vision Cognitive Services, and posted the image and resulting caption to Twitter, sometimes with hilarious results.

A fun thing people are doing today is plugging textual descriptions into AI image generators like Dall-E, Midjourney, and Craiyon.

I thought: why not take all three of these things — turning pictures into text, turning text into pictures, and the game of telephone — and put them together to see what arises? This blog post is strictly about a single cycle: from image to text to image again. The weak link (as you are about to see) is the AI-generated captioning. In my (admittedly brief) experiments, I found that extending the chain results in ever-weakening captions and increasingly-bland pictures. You’re also forced to exercise editorial control/intervention by picking one of the several generated images, which can artificially steer things in certain directions. I never achieved a drastic or surprising “eat poop you cat” moment with a long chain. We’re sticking with a single set of transforms here, but you’re welcome to try your own experiments.

Microsoft’s Demo & My Methodology

All of Microsoft’s code samples pointed to this one picture of colosseum ruins. Of all the generated photo captions, this one had the most rich description. Their computer vision API pulls out more metadata than just a single one-sentence description. I wrote a small Python program to dump the interesting fields. The results look something like this:

{
    "category": [
        "building_",
        "others_",
        "outdoor_"
    ],
    "objects": [
    ],    
    "description": "an ancient city with many ruins with Colosseum in the background",
    "tags": [
        "outdoor",
        "building",
        "sky",
        "stadium",
        "ancient rome",
        "ruins",
        "amphitheatre",
        "ancient roman architecture",
        "historic site",
        "ancient history",
        "history",
        "archaeological site",
        "travel",
        "large",
        "city"
    ]
}

In many cases just the description alone wasn’t enough to generate an interesting image. I’d end up with something like “a person in a garment” as the main caption, with all the interesting things happening in objets and tags. I wrote a second program to help with cleaning up the data. I turned it into a single line of text of the format: “description; unique list of objects and tags.” In this case, “an ancient city with many ruins with Colosseum in the background; outdoor, building, sky, stadium, ancient rome, ruins, amphitheatre, ancient roman architecture, historic site, ancient history, history, archaeological site, travel, large, city.” Let’s run that long quote through some AI image-generation applications:

In order (and the rest of this post will be in this same order), the results of that long line of text are:

Dall-E : This is definitely something plausible, but there are a lot more trees and green than I expected. This is likely because the actual Colosseum has a good amount of greenery around it.
Midjourney : Epic, but also blurry and distorted.
Craiyon : Maybe a good balance between the two? Some are really badly distorted, but a couple mirror some of the original looks and color palette.

The following test runs fall into three categories: weird and unusual photos, fine art, and memes. Feel free to skim until you find something good.

Weird and Unusual Photos

Brick Building

I run a bot, Adjective Noun Bot (blog post, GitHub), that posts to Twitter and Mastodon. In short, it takes a pairing of adjective and noun, performs a safe web image search, and posts the result. It’s a good source of occasional off-the-wall pictures. Instead of starting with a truly weird one, I thought I’d try a relatively normal photo that the bot picked up.

“A large brick building with a ladder; dormer window, plant, tree, house, window, building, outdoor, tree, sky, winter, house, plant, snow, home.” I was worried that “ladder” doesn’t accurately convey the fire escape.

Dall-E : I was pleasantly surprised that one of the ladders turned into a fire escape. In a general hand-wavy sense, fairly close to the original picture.
Midjourney : While I do appreciate the windowless tower and the houses made of clouds and snow, I can’t say any of these look quite real.
Craiyon : Are we sure we know what shape a house takes?

Mushroom Cloud

I have no idea what is going on here and I’m totally fine with that.

“a man standing in front of a large explosion; person, weapon, hat, clothing, outdoor, sky, person, black and white”

All were very good representations. I had to delete “weapon” from the list for Dall-E. One of the Midjourney images was clearly not black-and-white. And does that mushroom cloud double as a hat?

Man and Baby at a Picnic

“A person lying on a baby; person, outdoor, human face, clothing, baby, boy, toddler, person, girl, laying, plant, grass, summer, child.” I was really hoping for more from “ON A baby,” but alas, no. Also: Microsoft appears to be having issues assigning a binary gender, saying the child is both a boy and a girl. Let them decide when they’re older.

All of the AIs focused on the baby and left out the man. Midjourney seems to have made a forest-nymph baby covered in patches of grass. Craiyon has created nightmare babies that will relentlessly haunt us from the cabbage patch.

Demon

Speaking of nightmares — how about a little demon? Or at least, a demon mask? Does Kiss get some kind of royalties for this?

“A person in a garment; fictional character, anime, cartoon, fiction, mask, cg artwork, horror, animation, hero, masque, illustration, batman, digital compositing.” Apparently Batman has slid into my keywords. This should be interesting. Although Dall-E is sensitive to proper names, both real and fictional, I did not have to alter or remove “batman” from the list.

Dall-E : I do like how it captured some of the demonic aspects of the original photo.
Midjourney : Artistically, I really like these. The white mask (1), the empty eyes (3), and the mask-on-a-mask (4).
Craiyon : This is the gritty reboot of 60s Batman that nobody asked for. Kapow!

Demon Animal

In my head, this was a cat, but Microsoft differed with me. “A black and white drawing of a dog with long hair; mammal, sketch, drawing, painting, animal, art, mammal, illustration, black and white.” None of this captures the dark, demonic nature. We’re just going to end up with dog sketches, right?

Dall-E : Dog sketches.
Midjourney : Dog ske— what the hell is going on with that two-snouted dog in square (3)???
Craiyon : Dog sketches.

Laboratory

“A person in a garment; person, clothing, person, black and white, white, black, monochrome, wearing, standing, sunglasses.” This is all fairly generic and doesn’t really capture the head-wrap, the partially occluded “laboratory,” the locks, the goggles, or the gloves. We’ll likely end up with beachgoers and FBI agents.

All fashion models, all the time.

Thug & Grandma

“A man and woman wearing clothing; person, Glasses, person, clothing, human face, indoor, wall, holding, glasses” Well that sounds pleasant. Just two people wearing clothes. No mention of balaclava, no mention of pistol. They’re not even naked, they’re wearing clothing!

Dall-E : Relatively straightforward portraits of a couple.
Midjourney : If we ignore the headless man and the lady whose head is on backward, we have relatively straightforward illustrations of a couple.
Craiyon : Hello fellow normal glasses-wearing humans. We are normal humans with normal glasses just like you are. Take us to your leader.

Toxic

(Cue the Britney Spears music.)

“Icon; stop sign.” Well, this isn’t going to go well. At least we’ll get some nice stop signs. Those are relatively normal octagonal things in the US, and a different-shaped kind of normal thing elsewhere in the world.

Dall-E : Yep, these are different takes on stop signs. Different fonts, different visual ratios. I kind of like the rounded-octagon.
Midjourney : whatisthisidonteven. We have this general red octogon/hexagon theme. Mostly-red gons. I like that. I can’t tell if those are upside-down ¡¡¡exclamation points!!! or people. I kind of like the lightning bolt from a god (maybe Thor?) that will smite you if you don’t stop. The painting is serene, but could almost be a fine-art take on a futuristic Arby’s sign.
Craiyon : A for effort. The first one and middle one are almost there. I don’t want to have to drive in whatever country Craiyon thinks it is living in.

None of these are toxic.

Confederates

I was a liiiiitle worried that these confederate soldiers might uncover some unconscious bias in the AI’s training, for both the captioner and the image generation. “A group of men on horses; horse, person, outdoor, rein, mare, tree, horse supplies, stallion, bridle, halter, horse tack, clothing, mammal, person, mane, animal, horse, cowboy” Microsoft distilled them to something downright wholesome. Some cowboys and horses, a tree. Maybe asking for a fence was a bridge too far.

Dall-E : Photos of cowboys, horses.
Midjourney : Illustrations of cowboys, horses.
Craiyon : Wait a minute. That first and that last picture? Are those horses with a butt at each end and a horse with a head at each end? I don’t really want to know what sort of biology or wormhole physics is involved in how they eat and poop.

Fine Art

Night Watch

The previous painting, for whatever reason, gave me some Night Watch vibes. Let’s give that one a try. “A group of people in clothing; person, painting, art, person, clothing, visual arts, people, group.” You can’t possibly get more generic. They’re even wearing clothing! Let’s give this a try.

Let’s just throw “night” right out the window. Hey, while we’re at it, let’s throw “watch” out, too. What you’re left with is bland groups of people wearing bland clothing.

Tower of Babel

The Tower of Babel isn’t too different from the Microsoft example. Let’s give that a try. “A painting of a building on a cliff; painting, drawing, sky, cloud, art, building, outdoor, castle.”

The collective subconscious that went into the training models seem to think cliffside castles are along waterfronts and have lots of airy sky and clouds. Admittedly, the caption didn’t include that it’s bordering a city and the cliffs are only a minor part of the overall tableau.

La Grande Jatte

Anyone who’s seen Ferris Bueller knows this Seurat painting. And yes, even Microsoft does a decent job of describing it! “A painting of a family in a park; mammal, person, painting, person, drawing, clothing, art, grass, outdoor.”

Dall-E : Love it, although in retrospect, maybe “mammal” was a bit too literal of a keyword.
Midjourney : Quite nice. I like the three paintings more than the illustration, but that’s personal preference.
Craiyon : I made you a drawring and I used all my craiyons. Adorable in its own way, but a far departure from the source material.

American Gothic

This should be easy, right? A man and a woman. Check. They’re wearing clothes. Check. “A man and woman posing for a picture; person, Glasses, clothing, human face, person, painting, vestment, building, man.” Pitchforks are overrated. Background architecture is over, man!

Welp. You get out what you put in. The Craiyon people are truly terrifying.

Persistence of Memory

Shifting to the surreal, let’s try this Dali. I’ll just paste in the summary: “A video game of a robot; painting, drawing, sky, art, child art, illustration, cartoon, outdoor, beach.” I’d play that game. It sounds adorable. Can you imagine if Dali made video games???

Dall-E : Adorbs.
Midjourney : Two points from ravenclaw. Not enough row-bot!
Craiyon : I love your adorable little craiyon drawings of robots.

Son of Man

Let’s try a slightly more normal, or at least describable, surrealist. “A man wearing a suit and hat; person, tie, clothing, sky, coat, outerwear, person, fashion accessory, collar, man, cloud, sleeve, wearing, hat, outdoor, suit.” Welp. No apple or other fruit. This is going to be a challenge.

Dall-E : All of these nailed the suit, hat, and sky background.
Midjourney : Ignoring the invisible man and Mr. Neck, these also have good composition that captures a lot of the original. Heck, maybe the invisible man could have been a Magritte trope in an alternate universe.
Craiyon : A for effort. Drawing people is hard.

Whistler’s Mother

Nobody knows what Whistler looks like, but everyone’s familiar with his mother. “A person in a black dress; person, painting, clothing, picture frame, indoor, wall, person, art, woman, sitting”

Dall-E : Generally, Dall-E has been the clear winner, but I find all of these quite bland.
Midjourney : These are good illustrations, but only two of them truly capture the “sitting” keyword.
Craiyon : I know that hoomans are a very difficult thing to draw, especially in craiyon, but I really love the compositions with the lady sitting in or climbing through the empty frames.

Memes

Dino on a Surfboard

Let’s rewind to where we started. The so-bad-it’s-good description, surfaced from PicDescBot, of “a dinosaur on a surfboard.” I found the tweet, right-clicked the image, and loaded it into the captioner.

Microsoft has since improved their image processing and are now ignoring the surfboard. “A black and white drawing of a dinosaur; dinosaur, animal, mammal, sketch, drawing, illustration.” Friends — I need to level with you. I cheated. I literally used “a dinosaur on a surfboard.” How could I not?

These are all beautiful in their own respective ways. It would have been nice to have a little more photorealism, but I’m not going to complain. Too much.

Some Ape

Speaking of right-clicking. I right-clicked on something else. I think I might go to jail for it because somebody on some blockchain somewhere has a receipt for this picture. And that receipt is as binding as my love for dinosaurs on surfboards.

“A person wearing sunglasses; person, cartoon, clipart, illustration, animated cartoon, sketch, drawing, animation, art, design.” I’m going to have to stop you at “person,” Microsoft.

All perfectly fine caricature illustrations of people wearing sunglasses. We’re far enough down the blog post that if we manually intervene and change “person” to “ape,” nobody will notice, right?

See? These are all perfectly reasonable representations of apes wearing sunglasses. Might be cooler with 3D glasses, but I’m not going to complain. Who wants to turn these into NFTs?

Spiderman Pointing

Let’s conclude this with Spiderman pointing. Full-disclosure: I can never remember if spider-man has the hyphen or bat-man has the hyphen. “Diagram; person, fictional character, fiction, superhero, hero, illustration, cartoon.” I’m sure that captures the essence. Diagram. Nothing about the pointing or that it’s two characters.

Dall-E : Diagrams, check.
Midjourney : Superhero, check. I kind of dig the art-deco stylings.
Craiyon : You just keep doing what you’re doing, super-friend.

Conclusion

I don’t really know what to conclude here. Most of the chains fell apart at the describe step. I don’t blame anyone, since captioning images is a hard algorithmic problem to solve. The only alternative that I’m aware of is AWS Rekognition. It’s decent for identifying object in photos. I wrote about it back in 2016, asking Rekognition to analyze still frames from movies. It’s improved over the years, no longer identifying everything green as a potted plant, but the Azure services edge it out in a few places. Having an overall summary is nice. Identifying meta things about the picture (photo vs painting vs illustration) is especially nice when trying to recreate an image.

Mostly, though — have fun. Poke around at these systems. Look for boundaries and explore new creative spaces. Although they do a lot of the heavy lifting, remember that these are just tools — like a crane or a dumptruck. Use them to make cool things. And … I don’t know, make infinite variations on those things and sell them to suckers as NFTs or something. Wait, no. Don’t do that at all.

Posted in: