Since the rise of Kuaishou's AI technology, China's video generation landscape has become increasingly competitive, reminiscent of the surge in text-based models in 2023. Recently, another breakthrough was announced with the official launch of "Qingying," a video generation model from Zhipu AI. With just a creative idea—whether it's a few words or a detailed description—and a bit of patience (around 30 seconds), users can generate high-definition videos (1440×960) effortlessly.
Starting today, Qingying is available on the Qingyan App, allowing users to experience comprehensive features, including dialogue, images, videos, coding, and agent generation. In addition to the web and app versions, users can also explore video effects on the "AI Dynamic Photo Mini Program," enabling dynamic effects on photos stored on their devices. The generated videos can be up to 6 seconds long and are free for all users.
As the technology evolves, Zhipu AI envisions integrating Qingying's capabilities into short video production, advertising, and even film editing. The ongoing research and development in generative AI video models operate under the principles of Scaling Law, enhancing both algorithms and data utilization. "We are exploring more efficient scaling methods at the model level," explained Zhipu AI's CEO, Zhang Peng, during the Zhipu Open Day event.
Qingying demonstrates impressive versatility across various video styles and genres, including landscapes, animals, science fiction, and cultural history. It excels at producing content in cartoon, realistic photography, and anime styles. The effectiveness of content generation ranks as follows: animals, plants, objects, architecture, and human figures.
For text-generated videos, users can input prompts like "a low-angle shot of a dragon suddenly appearing on an iceberg," resembling a Hollywood film style. For image-generated videos, users can prompt scenarios such as "a standing capybara holding ice cream," showcasing an engaging interaction between digital and real-world presentations. Moreover, the "Old Photos Come Alive" feature allows users to animate nostalgic images effortlessly.
Zhipu AI's innovations don’t stop there—through continuous research since 2021, they've developed a series of multimodal generative AI models, culminating in the creation of "CogVideoX." This latest model integrates text, temporal, and spatial dimensions, leveraging a redesigned architecture to enhance inference speeds by six times compared to its predecessor, CogVideo.
While advancements like OpenAI's Sora have significantly improved video generation coherence, many models still struggle with maintaining logical consistency. Zhipu AI has addressed this with an efficient 3D Variational Autoencoder structure that compresses video data, dramatically lowering training costs and complexity. Additionally, an end-to-end video understanding model generates detailed descriptions for vast amounts of video data, enhancing the quality of training datasets.
Lastly, Zhipu AI has developed a Transformer architecture that merges text and video inputs effectively, ensuring seamless interaction between both modalities. As a result, users can expect a sixfold improvement in inference speed, making video creation more accessible than ever.
With the launch of Qingying, Zhipu AI enters the competitive video generation arena. In addition to the user-friendly app, an API is now available on the bigmodel.cn platform, allowing businesses and developers to harness the power of text-to-video and image-to-video capabilities.
As companies continue to roll out AI-driven video functionalities, the generative AI landscape is heating up, offering more options for users. Whether individuals are complete novices or seasoned content creators, the capabilities of large models empower everyone to explore video creation.