Meta has made significant strides in the realm of generative AI with the launch of its upgraded image generation foundation model, Emu (Expressive Media Universe). This powerful model now boasts the ability to generate videos from text, alongside enhanced capabilities for precise image editing.
Initially showcased during the Meta Connect event in September, Emu's technology provides a foundation for many dynamic generative AI experiences across Meta's social media platforms. For example, it enhances image editing tools on Instagram, allowing users to seamlessly change a photo's visual style or its background. Emu is integrated into Meta AI, a new user-assistant platform that operates similarly to OpenAI’s ChatGPT.
The new Emu Video model stands out for its dual capability to produce videos based on natural language text, images, or a combination of both. Unlike previous models such as Make-a-Video, which relied on five diffusion models, Emu Video operates using a more streamlined approach employing just two. The process unfolds in two main steps: first, it generates an image based on the text prompt, and then it creates a video guided by both the text and image prompts. This simplified methodology enables more efficient training of video generation models. In user studies, Emu Video outperformed Make-a-Video, with 96% of participants preferring the quality and 85% agreeing that it adhered more closely to their text prompts. Additionally, Emu Video can bring to life images uploaded by users, animating them according to specific text prompts.
Another exciting update is the introduction of Emu Edit, which enhances the editing capabilities of images using natural language instructions. Users can upload an image and specify the adjustments they wish to see. For instance, they can request the removal of an element, like a poodle, and replace it with a different object, such as a red bench—simply by typing their request. While there are existing AI-driven image alteration tools, such as Stable Diffusion-powered ClipDrop and image editing features on Runway, Meta’s researchers noted that existing methods often result in over-modification or under-performance in editing tasks.
In a blog post, Meta emphasized that the goal should not only be to create a "believable" image but to focus on accurately modifying only the pixels pertinent to the user's specific request. The team discovered that integrating computer vision tasks as instructions for image generation models delivers unparalleled control in the editing process.
To develop Emu Edit, Meta utilized a comprehensive dataset of 10 million synthesized images, each comprising an input image, a detailed task description, and the targeted output image. This allows the model to adhere closely to user instructions while maintaining the integrity of unrelated elements in the original image.
For those interested in exploring Emu Edit’s capabilities, they can view the generated images on Hugging Face. Additionally, Meta has introduced the Emu Edit Test Set, a new benchmark designed to facilitate further testing of image editing models. This set includes seven different image editing tasks, such as background alterations and object removals, paving the way for advancements in precise image editing technologies.