Enhancing Multimodal Large Models: Introducing Groma for Improved Object Localization
Multimodal large models (MLLMs) display remarkable cognitive abilities across various visual tasks, yet they often struggle with accurately mapping recognized content back to specific locations in images. While these models excel at identifying objects, their inability to localize these objects hinders applications in fields like image editing, autonomous driving, and robotics.
To tackle this limitation, researchers from the University of Hong Kong, in collaboration with ByteDance, have developed Groma—a novel paradigm designed to enhance object localization in multimodal models through regional image encoding. By incorporating localization, Groma connects textual content to specific image regions, significantly boosting interactivity and relevance during dialogues.
The primary challenge for MLLMs is enabling them to link textual information with corresponding visual areas. Current methods, such as fine-tuning large language models to generate object coordinates, present several challenges:
1. Pre-trained language models struggle with spatial understanding, leading to difficulties in precise localization with limited data.
2. Localization tasks demand high image resolutions, which significantly increase computational load.
3. The output format of language models is often inadequate for complex localization tasks like segmentation.
Groma addresses these issues by shifting the localization responsibility to the vision tokenizer of the multimodal model. The vision tokenizer detects and locates potential objects, while the large language model identifies them, effectively utilizing the tokenizer's spatial understanding and reducing reliance on external expert models.
Groma employs regional encoding to enhance localization capabilities, starting with a Region Proposer that identifies potential objects, followed by a Region Encoder, which converts these areas into region tokens. The language model interprets these tokens to associate them with their corresponding regions, fostering a hyperlinked effect that supports visually grounded conversations. Users can specify regions, which are then encoded into tokens for tailored responses by the multimodal model.
To improve localization accuracy, Groma utilizes over 8 million training data points for its Region Proposer. This extensive training encompasses a variety of objects, their components, and broader backgrounds. Groma’s modular design allows for high-resolution feature maps during the Region Proposer and Encoder processes while maintaining lower resolutions for the large model input, optimizing computational efficiency without sacrificing performance.
Experimental results indicate that Groma surpasses MiniGPT-v2 and Qwen-VL on traditional grounding benchmarks and shows robust dialogue and reasoning abilities in the VQA Benchmark (LLaVA-COCO). Visually, Groma displays higher recall rates and fewer inaccuracies, and it supports referential dialogues and grounded conversations, merging conversational skills with localization.
While large language models excel in cognitive reasoning for visual tasks, traditional methods like detection and segmentation rely heavily on perceptual capabilities. Groma innovatively decouples perception from cognition, allowing the vision tokenizer to manage perception while the large language model focuses on cognition. This sequential approach aligns with human visual processing and alleviates the computational burden of retraining extensive language models.
On May 15, ByteDance introduced its self-developed Doubao model, which features multimodal capabilities across more than 50 applications, significantly enhancing efficiency and driving intelligent innovation. Currently, Doubao APP leads the AIGC application landscape in China, showcasing ByteDance’s commitment to investing in top talent and advanced technology within the industry.