Just yesterday, I pondered whether Google would launch an AI product successfully on its first attempt. With the unveiling of VideoPoet, it seems we have our answer.
This week, Google introduced VideoPoet, a groundbreaking large language model (LLM) created by a team of 31 researchers at Google Research, aimed at diverse video generation tasks.
The development of this LLM is particularly noteworthy. According to the team’s pre-review research paper, “Most existing models use diffusion-based methods, widely regarded as the leading performers in video generation. Typically, these models begin with a pretrained image model, such as Stable Diffusion, to create high-fidelity images for individual frames and further fine-tune to enhance temporal consistency across frames.”
In contrast, Google’s research team opted for an LLM, based on the transformer architecture commonly used for text and code generation (e.g., ChatGPT, Claude 2, Llama 2). However, VideoPoet was specifically trained for video creation.
The Importance of Pre-training
The success of VideoPoet stems from extensive pre-training on 270 million videos and over 1 billion text-image pairs sourced from the public internet and beyond. This data was transformed into text embeddings, visual tokens, and audio tokens that the model could utilize.
The results are impressive, especially when compared to advanced consumer-oriented video generation tools like Runway and Pika, the latter being a Google investment.
Longer, Higher Quality Clips with Improved Motion
Google Research claims that their LLM-based approach enables the creation of longer, high-quality clips, addressing current limitations faced by diffusion-based video generation AIs, which often struggle to maintain coherent motion over extended sequences.
As team members Dan Kondratyuk and David Ross noted in a Google Research blog post, “One of the current bottlenecks in video generation is the ability to produce coherent large motions. Many leading models either generate small movements or produce noticeable artifacts when attempting larger motions.”
VideoPoet, however, can deliver larger and more consistent motion across videos of up to 16 frames. It also offers a diverse range of functionalities from the outset, such as simulating various camera movements, visual styles, and even generating new audio to complement the visual content. Importantly, it processes multiple input types—text, images, and videos—as prompts.
By consolidating these video generation features into a single LLM, VideoPoet eliminates the need for multiple specialized tools, providing a cohesive, all-in-one solution for video creation.
In fact, a survey conducted by the Google Research team found that viewers preferred VideoPoet-generated clips. When humans rated clips side-by-side with diffusion models like Source-1, VideoCrafter, and Phenaki, VideoPoet videos were consistently favored.
According to the Google Research blog, “On average, raters selected 24–35% of VideoPoet examples as better aligned with prompts than competing models, compared to just 8–11% for others. Additionally, 41–54% of VideoPoet examples were rated as having more interesting motion than 11–21% from other models.”
Designed for Vertical Video
Google Research has customized VideoPoet to generate portrait-oriented (vertical) videos by default, appealing to the mobile video audience popularized by platforms like Snapchat and TikTok.
Looking to the future, Google Research aims to broaden VideoPoet’s functionality to support “any-to-any” generation tasks, including text-to-audio and audio-to-video, further advancing the potential of video and audio generation.
Currently, VideoPoet is not available for public use, and we are awaiting information from Google regarding its release. Until then, anticipation builds as we look forward to seeing how it measures up against other tools in the market.