How Competitive is Multimodal AI by the End of 2023? Insights from Google’s Recent Developments
On December 6, Google launched its native multimodal model, Gemini, posing a direct challenge to GPT-4. Shortly afterward, on December 14, the company introduced Imagen 2, a text-to-image model positioned as a strong competitor to DALL•E 3 and Midjourney.
Google is deeply committed to progressing multimodal technology. Imagen 2 employs advanced text-to-image diffusion techniques, enabling users to generate high-quality, realistic images based on simple natural language prompts. This model excels in image comprehension, featuring capabilities such as visual question answering, which provides detailed insights about elements within images. It can also interpret and visualize complex abstract concepts, from poetry to literature.
A significant enhancement in Imagen 2 is its ability to render realistic hands and facial features, an area where many AI art generators fall short. Its handling of light and detail is equally impressive. For instance, prompts like “A shot of a 32-year-old female conservationist in a jungle; athletic with short, curly hair and a warm smile” yield stunning visuals. Similarly, requests for images like “a French bulldog at the beach” are executed with remarkable finesse.
Imagen 2 also captures the essence of abstract texts beautifully. For example, when prompted with a line from Phillis Wheatley’s poem, it succinctly conveys the line "Streams murmuring, birds chirping, their mixed music wafts through the air." The model excels in generating evocative imagery from classic works like "Moby Dick" and "The Secret Garden," demonstrating its depth of literary understanding.
Additional features enhance Imagen 2’s functionality, such as inpainting (generating content within an existing image) and outpainting (extending images beyond their original dimensions). It supports six languages beyond English—Mandarin, Hindi, Japanese, Korean, Portuguese, and Spanish—with plans for expanding this in early 2024.
Google is focusing on Imagen 2’s marketing capabilities, making it ideal for logo design and product advertisement creation. The model ensures accurate integration of specific text or phrases into images.
Security is a critical feature of Imagen 2, incorporating SynthID for watermarking and identifying AI-generated content with invisible digital watermarks. The model has undergone rigorous data safety training and includes filters to prevent the creation of harmful content, such as violence or offensive material.
Currently, access to Imagen 2 is limited to a select group of Vertex AI customers. Vertex AI, Google Cloud's managed AI platform, serves as a training ground for AI applications, reflecting Google's strategy to cultivate an AI ecosystem centered around Google Cloud to benefit developers. Since the integration of generative AI technology into Vertex AI earlier this year, user growth has surged over 15 times.
As Google advances in the multimodal AI landscape, the implications for the industry are significant, paving the way for more sophisticated and accessible AI applications for businesses and creators alike.