Google's track record in the realm of image-generating AI has faced significant challenges. Back in February, the image generator integrated into Gemini, Google’s AI chatbot, erroneously introduced gender and racial diversity into prompts about people, leading to the creation of deeply offensive images, including racially diverse depictions of Nazis.
In response, Google retracted the generator and committed to enhancing it before a future re-launch. While we await this revamped tool, Google is introducing Imagen 2, an upgraded image-generating tool within its Vertex AI developer platform, with a strong focus on enterprise applications.
Enhanced Image Creation with Imagen 2
Imagen 2, unveiled in December after being previewed at the I/O conference in May 2023, is a suite of models capable of generating and editing images from text prompts, similar to OpenAI’s DALL-E and Midjourney. Notably catering to business needs, Imagen 2 can produce text, logos, and emblems in various languages and superimpose them onto existing images, making it ideal for applications such as business cards and product branding.
After its initial preview, image editing functionality with Imagen 2 is now widely accessible in Vertex AI, featuring two new capabilities: inpainting and outpainting. These features, which have long been staples of other image generators like DALL-E, enable users to eliminate unwanted elements, add new features, and extend image boundaries for an expanded view.
The standout feature of this upgrade is what Google refers to as “text-to-live images.” Imagen 2 can generate brief, four-second videos directly from text prompts, akin to AI-driven video generation tools like Runway and Pika. Targeting marketers and creatives, Google envisions live images as a GIF generator for visually appealing subjects such as nature, food, and animals—areas where Imagen 2 has been fine-tuned for optimal performance.
Google asserts that these live images can capture “a range of camera angles and motions” while ensuring “consistency throughout the entire sequence,” though current outputs are limited to a resolution of 360 by 640 pixels, with promises for future enhancements.
To address concerns regarding deepfakes, Google plans to implement SynthID, a technology developed by Google DeepMind, which embeds invisible, cryptographic watermarks within live images. However, detecting these watermarks requires a proprietary Google tool that is not accessible to third parties.
Additionally, aiming to prevent further controversies regarding generative media, Google emphasizes that live image outputs will be “filtered for safety.” A representative stated, “The Imagen 2 model in Vertex AI has not encountered the same issues as the Gemini app. We continue extensive testing and collaboration with our clients.”
While these assurances are noteworthy, the question arises: Are live images competitive with existing video generation tools? Unfortunately, the answer is no. Tools like Runway can produce 18-second clips at significantly higher resolutions. Stability AI’s Stable Video Diffusion offers superior customizability in frame rates. Furthermore, OpenAI’s upcoming Sora, although not yet commercially available, promises exceptional photorealism that could outshine the competition.
What are the distinct technical advantages of live images, then? It seems unclear. Google has indeed developed impressive video technology like Imagen Video and Phenaki. For instance, Phenaki can transform detailed prompts into two-minute-long “movies,” albeit with limitations in resolution and coherence.
Given recent reports that suggest the generative AI landscape has caught Google’s leadership off guard, it’s understandable that a product like live images might seem underwhelming. This observation raises concerns about whether a more advanced version of this product might have existed within Google's research.
Generative models like Imagen rely on a vast array of examples typically sourced from public websites and data sets. Many AI vendors regard this training data as a crucial competitive edge and are hesitant to disclose it, as it also poses potential intellectual property risks.
When inquiring about the training data used for Imagen 2 and whether creators could opt out of the process if their work was included, Google responded that its models are “primarily” trained on publicly available web data from sources such as “blog posts, media transcripts, and public forums.” However, specifics regarding these sources remain vague.
A representative referenced Google's web publisher controls that enable site owners to block data scraping by the company, but Google did not commit to providing an opt-out mechanism or compensating creators for their unintentional contributions—a step that competitors like OpenAI and Adobe have taken.
It's also critical to note that the text-to-live images feature does not fall under Google's generative AI indemnification policy, which shields Vertex AI customers from copyright claims linked to Google's use of training data and outputs from generative AI models. This is due to text-to-live images currently being in preview status; the policy only applies to products in general availability (GA).
Concerns around regurgitation—where a generative model reproduces exact examples from its training data—remain valid for corporate users. Both informal and academic studies have shown that the initial version of Imagen was not immune to this issue, sometimes generating recognizable images of individuals or copyrighted artistic works under certain prompts.
Barring any significant controversies or unforeseen challenges, text-to-live images will eventually transition to general availability. However, given its current state, Google is essentially advising users to proceed with caution.