Why 'Visual' AI Models May Not Actually Perceive Anything

The latest language models, such as GPT-4o and Gemini 1.5 Pro, are being hailed as “multimodal” technologies capable of processing images and audio alongside text. However, a recent study reveals that their understanding of these modalities may not be as nuanced as we expect. In fact, it raises questions about whether they truly "see" at all.

Despite some bold marketing claims asserting that these AI models possess advanced “vision capabilities” and “visual understanding,” there is a disconnect between what these models can do and our typical understanding of sight. These systems do analyze images and videos, promoting their ability to solve tasks ranging from math homework to sports analysis.

The underlying message from the developers is clear: they want us to believe that these models can “see” in a meaningful way. Yet, their capabilities appear more akin to performing calculations or storytelling—essentially, they match input data patterns with those from their training. As a result, these models often struggle with seemingly simple tasks, much like they would when asked to generate a random number.

Researchers from Auburn University and the University of Alberta conducted an informal but systematic study of the visual understanding of current multimodal AI models. They tested these models on basic visual tasks, such as determining whether two shapes overlap, counting pentagons in an image, or identifying circled letters in a word (you can view a summary micropage here). Surprisingly, these are tasks even young children would excel at, yet the AI models faced significant challenges.

“Our seven tasks are so simple that humans achieve 100% accuracy. We expect AI to do the same, but they are currently NOT meeting that standard,” co-author Anh Nguyen stated in an email. “Our message is clear: these leading models are STILL struggling.”

Take the overlapping shapes test, for instance. This task simply involves identifying whether two circles overlap, touch, or are separated. While GPT-4o performed well with distant circles, achieving over 95% accuracy, it plummeted to just 18% success with overlapping or closely situated circles. Gemini 1.5 Pro performed better, but still only managed 70% accuracy at close distances.

Similarly, consider the challenge of counting interlocking circles in an image. The models performed perfectly when there were five interlinked rings, but when asked to count six, they struggled significantly. Gemini faltered completely, while Sonnet-3.5 answered correctly just one-third of the time, and GPT-4o managed just under 50%.

This experiment highlights that these models' responses don’t align with our understanding of vision. Even if they showed some capability, the inconsistencies in their performance suggest a lack of genuine visual comprehension.

One potential explanation for their varying success is the specific training data these models have encountered. For instance, they perform well with five interlocking circles likely because the Olympic Rings feature prominently in their datasets. However, where would they encounter six or seven interlocking rings? The evidence suggests they haven't been trained on it, indicating a limited understanding of concepts like rings or overlaps.

I inquired about the “blindness” the researchers attributed to the models. Nguyen expressed that while "blind" carries many meanings—even for humans—it inadequately captures the AI's shortcomings regarding visual sensitivity. Currently, no existing technology allows us to fully visualize what an AI model perceives, as their responses depend on a myriad of factors, including the input images and extensive model parameters.

Nguyen speculated that while these models demonstrate a degree of visual awareness (for example, recognizing "there’s a circle on the left side"), they lack the capacity for visual judgment, resulting in interpretations akin to those of someone who has theoretical knowledge but no direct experience.

In one illustrative example, Nguyen pointed out that when a blue circle overlaps a green circle, creating a cyan-shaded area reminiscent of a Venn diagram, an informed individual might logically answer about the area. However, a person with their eyes closed might mistakenly claim a different response.

Does this mean that “visual” AI models are ineffective? Not at all. Their inability to perform fundamental reasoning tasks does raise important questions about their broader capabilities. In fact, these models excel in recognizing human actions, expressions, and familiar objects, which align closely with how they are intended to function.

If we solely relied on AI companies’ marketing, we might think these models possess perfect vision. Research like this serves to clarify that even when models accurately describe whether someone is sitting, walking, or running, they do so without “seeing” in the conventional sense of the word.

Most people like

Find AI tools in YBX

Related Articles
Refresh Articles