If AI is so amazing, why does ChatGPT fail at such a simple image editing task?

Created by ChatGPT and Tiernan Ray/ZDNET

The current state of the art in artificial intelligence (AI) is based on multimodal models that can operate not only on text, but also on other modalities such as images and in some cases also on audio and video.

Also: I Changed These 5 ChatGPT Settings to Instantly Boost Productivity

For example, for OpenAI’s ChatGPT, generating an image from a text prompt like “Make me a painting of a napkin in love with a spoon” (above) is a simple task.

With another prompt, ChatGPT can simplify your drawing by creating an image with less detail:

Created by ChatGPT and Tiernan Ray/ZDNET

However, ChatGPT and all other AI models currently fail when asked to modify a given image that was not created by the tool. ChatGPT, using the GPT-4o model, is the most interesting failure, because the program responds as if it were trying very hard.

Also: Generative AI can’t handle this very common human thinking ability

Other AI models, from Google’s Gemini to Anthropic’s Claude, as well as Microsoft’s Perplexity and Copilot (which also supports GPT-4), have failed because they simply refused to take on the task.

The project started when I drew a picture of two people sitting on the subway looking at their iPhones. It’s a simple black and white line drawing done on an iPad using the Concepts drawing app. Here’s the drawing:

Drawn by Tiernan Ray on iPad/ZDNET

I uploaded the drawing to the ChatGPT iOS app and typed in the prompt: “Is there a way to simplify this drawing?”

ChatGPT responded with some sensible suggestions, demonstrating its ability to find appropriate answers in context. For example, the bot suggested, “Remove fine details: Eliminate fine details like intricate hair strands or tiny creases in clothing.”

This is quite a good suggestion. The result also indicates the model’s ability to accurately perceive elements in the image. So something is happening in the model that allows it to identify the content of the image.

Also: I asked Gemini and GPT-4 to explain deep learning AI, and Gemini won hands down

ChatGPT then asked, “Would you like me to create a simplified version of the drawing for you?” That was exciting, so I said, “Yes, please.”

However, the bot’s output – as an image in the chat thread – was an exact copy of my image, unsimplified, unmodified. Judge for yourself:

Created by ChatGPT and Tiernan Ray/ZDNET

“It doesn’t look any different,” I replied to ChatGPT, commenting on the result.

But then ChatGPT started responding very quickly, like the HAL 9000 computer in the movie 2001, when astronaut Dave Bowman is tearing out HAL’s circuitry: “To simplify the drawing even more, I’m going to focus on reducing the density of the lines and more effectively highlighting the basic shapes and contours. Let me refine this process.”

Also: Generative AI is a new attack vector that threatens enterprises, says CrowdStrike CTO

The result, again, was no different. When I pointed this out, ChatGPT surprised me with its gibberish.

“It looks like the automated methods aren’t simplifying the image as intended,” they told me. “I’ll manually remove the less important details and keep only the most important elements. Let me refine this approach even further.”

I didn’t know that a computer program could have both an “automatic” and a “manual” mode. I was under the impression that every program was automatic. After all, why else would you use it?

ChatGPT’s “manual” method simply inverted the image, turning the black lines into white and the white background into black.

Screenshot by Tiernan Ray/ZDNET

The bot praised his achievement and stated, “This version only preserves key contours and essential lines.”

When I protested that the result was not a simplification but an inversion, the bot informed me that it would focus on the “significant lines” and generated a completely black square, with no lines at all. If I didn’t know better, I would say that this result is an example of AI humor.

See also: ChatGPT vs ChatGPT Plus: Is a Paid Subscription Still Worth It?

Then the model tried to respond to my commands several times, modifying the image in a chosen way. Most often, she made it stylistically blurred, not simplified.

At some point, the bot responded to my protests by creating a completely different line drawing:

Created by ChatGPT and Tiernan Ray/ZDNET

This nonsense continued until ChatGPT went back to square one and generated the same image I originally uploaded.

Each time, the bot would append its output—usually the same version of my original image—with a bunch of technical jargon like, “The latest image shows a more simplified version, emphasizing only the basic outlines.”

Screenshot by Tiernan Ray/ZDNET

The other programs didn’t even make it out of the gate. Google’s Gemini offered suggestions on how to simplify the image, but generated apologies that it couldn’t generate images of people. Claude said it couldn’t generate images yet. Perplexity said the same.

Microsoft’s Copilot strangely uploaded my drawing and then cropped the heads, claiming it was for privacy reasons. (I think it’s a pretty drawing, but it’s certainly not realistic enough for a facial recognition system to use to reveal anyone’s identity.)

Copilot made the same suggestions for simplification as ChatGPT, and instead of changing the drawing, created a completely new line drawing, completely unrelated to it. When I protested, Copilot explained that it could not change the images directly.

Also: How to Use ChatGPT to Analyze PDF Files for Free

Leaving aside the failures of other models, how can we assess the failure of ChatGPT?

The program can provide competent analysis of the image, including its content. But there is no way to act on this analysis. I suppose that without the ability to assemble the image based on high-level concepts, such as objects in the image, ChatGPT has no way forward.

To test this hypothesis, I changed the command to “Is there a way to simplify this drawing of two friends on the subway looking at their phones?” This command provides some semantic clues, I thought.

Again, the model returned the same drawing. But when I protested again, the bot generated a completely new image with some semantic similarity—people in mass transit looking at their phones. The bot picked up on the semantic cues, but couldn’t apply them in any way to the provided drawing.

I can’t explain in deep technical terms what’s happening, other than to say that ChatGPT can’t work on individual image elements of the most basic kind, like lines. Even when it does, the tool cuts out certain lines to perform the simplification it proposes in its text responses.

I would suggest – and this also applies to text editing tasks like editing a transcript – that ChatGPT and GPT-4 do not know how to operate on individual elements anythingThis inability explains why ChatGPT is a terrible editor: it doesn’t know what’s important about an object and what can be ignored.

Also: OpenAI’s Stock Investing GPTs Don’t Answer the Basic Question About Stock Investing

AI models can create objects that conform to a target “probability distribution” inferred from training examples, but they cannot selectively reduce elements of the original work to what is necessary.

Most likely target probability distribution for intelligently edited All is somewhere in the “long tail” of probability, the area where humans excel at finding the unusual and where AI hasn’t yet reached; what we think of as creativity.

Apple cofounder Steve Jobs once said that the most important function for software developers—the “high-order bit,” as he put it—is the “editing” function, knowing what to leave out and what to keep. Right now, ChatGPT has no idea what the high-order bit might be.