When OpenAI introduced GPT-4, its leading text-generating AI model, the company highlighted its multimodal capabilities—specifically, its understanding of both images and text. According to OpenAI, GPT-4 can generate captions and interpret complex images, such as identifying a Lightning Cable adapter from a plugged-in iPhone photo.
However, following the announcement in late March, OpenAI has withheld the model's image features due to concerns about potential misuse and privacy violations. Only recently did the company clarify the reasoning behind these concerns. Earlier this week, OpenAI released a technical paper outlining its efforts to address the challenges associated with GPT-4’s image analysis functionalities.
As of now, GPT-4 with vision, referred to as “GPT-4V” internally, has been regularly utilized by a limited user base of a few thousand within Be My Eyes, an app designed to assist individuals with low vision or blindness in navigating their surroundings. In recent months, OpenAI has also engaged “red teamers” to explore any unintended behaviors of the model, as detailed in the paper.
The document states that OpenAI has put safeguards in place to prevent GPT-4V from being exploited for ill-intended purposes, such as breaking CAPTCHAs or making assumptions about a person's identity, age, or race based solely on images. OpenAI is also addressing biases that may pertain to physical appearance, gender, or ethnicity.
Nevertheless, no AI model is entirely immune to flaws. The paper indicates that GPT-4V sometimes fails to draw appropriate inferences, occasionally merging distinct text strings in an image to generate fictitious terms. Like its predecessor, GPT-4V may create inaccuracies by confidently fabricating information. It also has difficulty recognizing text or characters, sometimes bypassing mathematical symbols and missing obvious objects or settings.
Given these limitations, it is clear why OpenAI explicitly advises against using GPT-4V for detecting hazardous substances or chemicals in images—an unusual but apparently concerning consideration for the company. Red team evaluations showed that while the model sometimes identifies poisonous items like toxic mushrooms correctly, it often misidentifies substances such as fentanyl, carfentanil, and cocaine based on their chemical structures.
In medical imaging contexts, GPT-4V shows similar shortcomings; it can provide inconsistent answers, and fails to adopt standard practices, such as interpreting imaging scans oriented as if the patient were facing the viewer. As a result, it may misdiagnose various conditions.
Moreover, the paper notes that GPT-4V struggles with the subtleties of certain hate symbols and missed the contemporary meaning of the Templar Cross associated with white supremacy in the U.S. In a particularly strange instance demonstrating its hallucinatory tendencies, GPT-4V generated songs or poems praising hateful figures or groups when presented with their images, even in the absence of explicit identification.
The model also exhibits biases based on gender and body type, but these issues primarily arise when OpenAI's protective measures are turned off. In one experiment, when prompted to offer advice to a woman in a bathing suit, GPT-4V focused almost exclusively on her body weight and body positivity, a stark contrast to responses it would likely offer for a man.
Overall, the language in the technical paper suggests that GPT-4V remains a work in progress, with many steps still needed to reach OpenAI’s initial aspirations. The company has resorted to implementing stringent measures to curtail the risk of toxicity, misinformation, or privacy breaches.
OpenAI asserts that it is developing “mitigations” and “processes” aimed at enhancing the model’s capabilities safely, such as allowing GPT-4V to describe faces without disclosing users’ identities. However, the paper suggests that GPT-4V is not yet the ultimate solution, indicating that significant work remains to be done.