As if still-image deepfakes weren’t alarming enough, we may soon have to confront a new threat: generated videos that allow anyone to be manipulated just by sharing a photo online. With Animate Anyone, malicious actors can control people’s images like never before.
This innovative generative video technique was created by researchers at Alibaba Group’s Institute for Intelligent Computing. It represents a significant advancement over earlier image-to-video systems, such as DisCo and DreamPose, which, while once impressive, now seem outdated.
Animate Anyone isn’t entirely groundbreaking, but it has successfully navigated the challenging transition from "janky academic experiment" to "good enough to fool the casual observer." Currently, still images and text conversations are already dominating this space, fueling confusion over what’s real and what isn’t.
Image-to-video models work by extracting intricate details—such as facial features, patterns, and poses—from a source image, like a fashion photo showcasing a model in a dress. The system then generates a series of images where these details are mapped onto slightly altered poses, which can be motion-captured from other videos.
While earlier models proved the concept viable, they encountered numerous issues. Hallucination was a significant challenge; the model often had to invent realistic movements for elements like sleeves or hair, resulting in odd visuals that made the final video far from convincing. However, Animate Anyone has made marked improvements, though it still has room for growth.
The technical aspects of this new model may be complex for most, but the research highlights a crucial intermediate step that “enables the model to comprehensively learn the relationship with the reference image in a consistent feature space, significantly enhancing appearance detail preservation.” By retaining both basic and intricate details, the generated images have a more reliable foundation, leading to much-improved results.
The team showcases the model’s capabilities across various contexts. Fashion models seamlessly adopt arbitrary poses, 2D anime figures come to life and dance, and even Lionel Messi performs a few standard movements.
However, the technology is still not flawless, especially regarding the depiction of eyes and hands, which remain problematic for generative models. Additionally, poses that deviate too far from the original image pose challenges; for instance, if a person turns around, the model struggles to adapt. Nevertheless, this represents a monumental leap from previous models that produced excessive artifacts or simply lost key details, such as hair color or clothing patterns.
It’s unsettling to think that with just one high-quality image, a malicious actor could create a convincing video impersonating you. Coupled with advancements in facial animation and voice capture technology, they could even simulate you saying or expressing anything they desire. While the technology remains too complex and unreliable for widespread application today, the rapid pace of advancements in AI suggests that may not be the case for long.
For now, the developers are holding off on releasing the code. Although they have a GitHub page, they state: “We are actively working on preparing the demo and code for public release. While we cannot commit to a specific release date at this moment, we assure you that our intention to provide access to both the demo and source code is firm.”
Will chaos ensue when the internet is suddenly inundated with these “dancefakes”? We’ll find out, and likely sooner than we expect.