AI companies are in a fierce competition to advance video generation technology. In recent months, key players like Stability AI and Pika Labs have launched models that create videos from text and image prompts. Building on these advancements, Microsoft has introduced a new model called DragNUWA, designed to provide greater control in video production.
DragNUWA enhances traditional text and image input methods by incorporating trajectory-based generation, allowing users to manipulate objects or entire video frames along specific paths. This innovation facilitates precise control over semantic, spatial, and temporal aspects of video creation while ensuring high-quality outputs.
Microsoft has open-sourced the model's weights and demo, inviting the community to experiment. However, it’s essential to recognize that this remains a research initiative and is not yet fully refined.
What Makes Microsoft DragNUWA Unique?
AI-driven video generation has typically relied on text, image, or trajectory inputs, but these methods often struggle to offer detailed control. For instance, relying solely on text and images can miss the nuanced motion details crucial for video, and language alone may create ambiguity regarding abstract concepts.
In August 2023, Microsoft’s AI team introduced DragNUWA, an open-domain diffusion-based video generation model that integrates images, text, and trajectory inputs to enable precise video control. Users can define specific text, images, and trajectories to manage various elements, such as camera movements and object motion in the resulting video.
For example, users can upload an image of a boat on water, pair it with the text prompt “a boat sailing in the lake,” and provide directions for the boat's movement. This input generates a video of the boat navigating as specified, with the trajectory clarifying motion details, language outlining future objects, and images distinguishing between subjects.
DragNUWA in Action
The early version 1.5 of DragNUWA has just been released on Hugging Face, leveraging Stability AI’s Stable Video Diffusion model to animate images based on defined paths. As this technology evolves, it promises to simplify video generation and editing. Picture transforming backgrounds, animating images, and directing motion with a simple line.
AI enthusiasts are thrilled with this progress, viewing it as a significant step in creative AI. Nonetheless, the model's real-world performance remains to be seen. Early tests indicate that DragNUWA can accurately execute camera movements and object motions along various drag trajectories.
“DragNUWA supports complex curved trajectories, enabling objects to move along intricate paths. It also accommodates variable trajectory lengths, allowing for greater motion amplitudes. Additionally, DragNUWA can control the trajectories of multiple objects simultaneously. To our knowledge, no other video generation model has achieved such trajectory control, underscoring DragNUWA’s potential to advance video generation technology,” Microsoft researchers stated in their paper.
This work contributes to the ever-expanding field of AI video research. Recently, Pika Labs gained attention for its text-to-video interface, akin to ChatGPT, which generates high-quality short videos with various customization options.