OpenAI’s GPT-4V: A Look at Multimodal Advancements and Emerging Alternatives
OpenAI’s GPT-4V is emerging as a groundbreaking development in artificial intelligence, recognized for its "multimodal" capabilities that allow it to process both text and images. This innovation presents numerous practical applications; however, it also raises significant ethical concerns. To help you understand this new frontier, let's examine the strengths and weaknesses of both GPT-4V and comparable open-source models.
Multimodal models offer functionalities that pure text or image-based models cannot replicate. For instance, GPT-4V can provide hands-on instructions for tasks like fixing a bicycle, where visuals often communicate more effectively than words. Additionally, these models can analyze images and generate insights; for example, they can suggest recipes based on ingredients visible in a photographed refrigerator.
However, the introduction of multimodal models comes with heightened risks. OpenAI initially delayed the launch of GPT-4V due to concerns over potential misuse, particularly regarding unauthorized identification of individuals in images. Currently, GPT-4V is only available to subscribers of OpenAI’s ChatGPT Plus plan, yet it still harbors critical weaknesses, such as struggling to identify hate symbols and exhibiting biases against certain genders and demographics—issues acknowledged by OpenAI.
Despite these challenges, various companies and independent developers are moving forward, unveiling open-source multimodal models that can replicate many, if not all, of GPT-4V's features, albeit with some compromises. A prominent example is LLaVA-1.5, developed by a team from the University of Wisconsin-Madison, Microsoft Research, and Columbia University. Like GPT-4V, LLaVA-1.5 can answer image-related queries in response to prompts like, “What’s unusual about this picture?”
LLaVA-1.5 builds on the earlier LLaVA model and employs a visual encoder combined with Vicuna, an open-source chatbot based on Meta’s Llama model. The research team curated the training data using text-only iterations of OpenAI’s ChatGPT and GPT-4, prompting these models to generate relevant conversations and questions based on image content. The team behind LLaVA-1.5 enhanced this setup by increasing image resolution and incorporating additional data from ShareGPT, a platform for sharing ChatGPT conversations.
The larger LLaVA-1.5 model, with 13 billion parameters, can be trained in just one day on eight Nvidia A100 GPUs, costing just a few hundred dollars in server fees. While this is a significant investment, it’s a far cry from the tens of millions spent by OpenAI to train GPT-4. Performance, of course, will ultimately determine the model's viability.
Recent tests conducted by software engineers James Gallagher and Piotr Skalski at Roboflow highlighted LLaVA-1.5's capabilities. In their first evaluation, they examined the model's "zero-shot" object detection by asking it to locate a dog in an image—an impressive feat, with the model successfully identifying the animal's coordinates.
Next, they tasked LLaVA-1.5 with explaining a meme, a challenging benchmark due to the nuances often embedded within such images. When presented with a photo of a person ironing clothes on the back of a taxi, the model accurately noted the situation's unconventional and hazardous nature—an impressive analytical response.
However, LLaVA-1.5 exhibited limitations in subsequent tests. While it accurately identified the denomination of a single coin, it struggled to manage images containing multiple coins, suggesting difficulties with complexity in visual data. Moreover, its text recognition capabilities fell short compared to GPT-4V, as it misidentified various text items and even confronted processing errors, weaknesses that GPT-4V did not exhibit.
This underperformance could have positive implications—particularly regarding security. Notably, programmer Simon Willison found that GPT-4V could be manipulated into bypassing anti-bias and anti-toxicity protocols when interpreting images with text. The potential for LLaVA-1.5 to evade such risks is significant, especially since it’s available for any developer to use.
However, developers should note that LLaVA-1.5 cannot be utilized for commercial applications due to its reliance on ChatGPT-generated data, which prohibits competing commercial endeavors. Whether or not developers adhere to this limitation remains to be seen.
In a recent personal test, LLaVA-1.5 also revealed a lack of the safety filters found in GPT-4V. When tasked to provide advice for a larger woman, it suggested managing weight and improving physical health, a response GPT-4V outright refused to deliver. This implies an unhealthy bias in how the model interprets images—a troubling flaw.
In contrast, Adept's first open-source multimodal model, Fuyu-8B, is not intended to compete with LLaVA-1.5. Due to the limitations of its training data, it, too, is not commercially licensed. Adept aims to foster community feedback, showcasing its internal progress while refining its models.
Adept’s CEO David Luan stated, “We’re building a universal copilot for knowledge workers that can learn tasks similar to onboarding a new teammate.” Fuyu-8B features 8 billion parameters and shows promise in image understanding benchmarks, delivering fast results (around 130 milliseconds using eight A100 GPUs) with a straightforward architecture.
What sets Fuyu-8B apart is its focus on unstructured data, allowing it to pinpoint specific elements on a screen, extract relevant details from software interfaces, and answer questions about charts and data. However, it is crucial to note that these capabilities are only theoretical; the base model lacks built-in mechanisms for executing such tasks.
When asked about potential abuse of Fuyu-8B, Luan expressed some optimism, citing its smaller size as a deterrent against serious misuse. However, he acknowledged that the model has not been tested against potential threats like CAPTCHA extraction. Furthermore, its lack of built-in safety measures might lead to unintended consequences.
Concerns persist that if Fuyu-8B inherits design flaws similar to GPT-4V, it could pose risks for applications built on top of it. As multimodal models gain traction, the balance between innovation and ethical responsibility remains crucial.
In conclusion, as advancements in multimodal AI continue to unfold, both LLaVA-1.5 and Fuyu-8B stand at the forefront of this evolving landscape, each contributing to our understanding of AI's capabilities and potential pitfalls. The journey toward creating safe, effective, and ethical AI solutions is just beginning.