Apple has introduced a groundbreaking open-source AI model called “MGIE” (MLLM-Guided Image Editing), designed to edit images based on natural language instructions. Leveraging multimodal large language models (MLLMs), MGIE interprets user commands to execute precise pixel-level modifications. It excels in various editing tasks, including Photoshop-style adjustments, global optimization, and localized edits.
This innovative model is the product of collaboration between Apple and researchers from the University of California, Santa Barbara, and was presented at the International Conference on Learning Representations (ICLR) 2024, a leading venue for AI research. The research paper demonstrates MGIE's effectiveness in improving automatic metrics and human evaluations while ensuring competitive inference efficiency.
How Does MGIE Work?
MGIE harnesses the power of MLLMs—capable of understanding both text and visuals—to refine instruction-based image editing. Traditionally, MLLMs have been underutilized in image editing tasks despite their impressive capabilities in cross-modal understanding.
MGIE integrates MLLMs into the editing workflow in two primary ways:
1. Deriving Expressive Instructions: MGIE transforms user prompts into concise instructions for editing. For instance, inputting “make the sky more blue” could yield the instruction “increase the saturation of the sky region by 20%.”
2. Generating Visual Imagination: The model creates a latent representation of the desired edit, guiding pixel-level adjustments. MGIE employs a novel end-to-end training scheme that optimally combines instruction derivation, visual representation, and editing functions.
What Can MGIE Do?
MGIE is versatile, capable of handling a variety of editing scenarios from basic color adjustments to intricate object manipulations. Its features include:
- Expressive Instruction-Based Editing: Produces clear instructions that enhance both the editing quality and user experience.
- Photoshop-Style Modification: Performs common edits such as cropping, resizing, rotating, and advanced adjustments like background replacement and object blending.
- Global Photo Optimization: Enhances overall image quality, adjusting brightness, contrast, sharpness, and applying artistic effects.
- Local Editing: Targets specific areas within an image (e.g., faces, clothing), allowing users to modify attributes like size, color, and texture.
How to Use MGIE?
MGIE is accessible as an open-source project on GitHub, providing users with code, data, and pre-trained models. A demo notebook illustrates various editing tasks, and users can experiment with MGIE through an online demo hosted on Hugging Face Spaces.
Designed for user-friendliness, MGIE allows users to input natural language commands, generating edited images and detailed instructions. Users can provide feedback to refine edits or request alternatives, making it adaptable for integration with other applications requiring image editing capabilities.
Why is MGIE Important?
MGIE marks a significant advancement in instruction-based image editing—an essential area for enhancing both AI and human creativity. It demonstrates the possibilities of using MLLMs in image editing, facilitating new cross-modal interactions.
Beyond its research significance, MGIE serves as a practical tool for various applications, helping users create and optimize images for personal and professional contexts, including social media, e-commerce, and creative arts. It empowers users to express their ideas visually and encourages creative exploration.
For Apple, MGIE reinforces the company's growing leadership in AI research and development, showcasing its expanding machine learning capabilities with a focus on enhancing everyday creative tasks. While MGIE is a notable achievement, experts acknowledge the ongoing need for advancements in multimodal AI systems. Nonetheless, the rapid progress in this field indicates that assistive AI like MGIE could soon become an essential tool for creativity.