OpenAI is racing to launch its multimodal large language model, GPT-Vision, ahead of Google’s highly anticipated Gemini model. Reports indicate that OpenAI aims to unveil GPT-Vision before Google introduces its own multimodal model, Gemini. Following this release, OpenAI may further announce an advanced multimodal model called Gobi.
Earlier this year, OpenAI introduced GPT-4, which includes some multimodal capabilities. Unlike its predecessor, GPT-3.5, which only accepted text inputs, GPT-4 can also process images, although these visual features are not yet publicly available. In contrast, Gobi is being developed as a comprehensive multimodal model, designed to handle various input types more effectively.
Both OpenAI and Google are integrating multimodal features into their language models, allowing for the combination of text, images, audio, and other data forms. This integration aims to enhance user interactions, improving accuracy and overall experience. The competitive landscape between OpenAI and Google in creating these advanced models mirrors the tech rivalry between Apple and Android, driving technological innovation and shaping the future of AI development.
Gobi vs. Gemini: The Race for Multimodal Language Model Supremacy
Reports suggest that Google is poised to unveil Gemini, having already shared project details with select external companies. Meanwhile, OpenAI is diligently working to combine its advanced GPT-4 with multimodal capabilities, striving to launch Gobi before Google. Although OpenAI presented some multimodal features of GPT-4 earlier this year, training for Gobi has yet to begin, leaving its performance compared to GPT-5 uncertain.
Google holds a unique advantage due to its access to proprietary data from platforms like Google Search and YouTube, encompassing text, images, audio, and video. Users familiar with early versions of Gemini report that it delivers more accurate responses than existing language models.
Addressing Information Security in Multimodal Functions
When OpenAI unveiled GPT-4's multimodal capabilities in March, it initially restricted access to selected partners, like Be My Eyes, which assists visually impaired users. Now, OpenAI is ready to expand the rollout of GPT-Vision. According to reports, delays in launching Gobi stem from concerns about potential misuse of new visual features, such as automated CAPTCHA solving and facial recognition surveillance. Fortunately, OpenAI engineers are developing strategies to mitigate these security challenges.
Google's Gemini faces similar hurdles. When asked about safeguards against misuse, a Google spokesperson referenced commitments made in July to ensure responsible AI development across its products.
Conclusion: The Emerging Focus on Multimodal AI Models
The integration of multimodal features into large language models is set to significantly enhance analytical accuracy. OpenAI, known for ChatGPT, and established tech giant Google are both focused on advancing multimodal capabilities, underscoring a key trend in AI evolution. The competition reflects a broader technological contest that will likely spark important global discussions on applications, collaborations, regulations, and ethical considerations surrounding this innovative technology. The upcoming releases of Gobi and Gemini are expected to reveal the outcome of this rivalry and shape the future of AI development.