Today, Microsoft’s Azure AI team released a new vision foundation model called Florence-2 on Hugging Face.
This model, available under a permissive MIT license, excels in various vision and vision-language tasks through a unified, prompt-based framework. It offers two sizes—232M and 771M parameters—and demonstrates capabilities in tasks such as captioning, object detection, visual grounding, and segmentation, often outperforming other large vision models.
While the real-world performance of Florence-2 is still to be evaluated, it aims to provide enterprises with a cohesive strategy for diverse vision applications. This will reduce the need for multiple task-specific models that often limit their functionality and require extensive fine-tuning.
What Makes Florence-2 Stand Out?
Currently, large language models (LLMs) are integral to enterprise operations by offering services like summarization, marketing copy creation, and customer support. Their adaptability across different domains has been remarkable. This raises a question for researchers: Can vision models, typically designed for specific tasks, achieve similar versatility?
Vision tasks are inherently more complex than text-based natural language processing (NLP), as they demand sophisticated perceptual capabilities. A universal model must understand spatial data at various scales—ranging from broad concepts like object locations to intricate pixel details and high-level captions.
Microsoft identified two main challenges in creating a unified vision model: the lack of extensively annotated visual datasets and the need for a singular pretraining framework that can integrate spatial hierarchy and semantic granularity.
To overcome these obstacles, Microsoft developed a visual dataset named FLD-5B, comprising 5.4 billion annotations for 126 million images, detailing from general descriptions to specific object regions. This dataset trained Florence-2, which utilizes a sequence-to-sequence architecture combining an image encoder with a multi-modality encoder-decoder. This design allows Florence-2 to manage various vision tasks without the need for task-specific architectural changes.
“All annotations in the FLD-5B dataset are standardized into textual outputs, enabling a unified multi-task learning approach with consistent optimization through a uniform loss function,” the researchers noted in their paper. “The result is a versatile vision foundation model capable of handling multiple tasks within a single framework and governed by a consistent set of parameters. Task activation is accomplished through textual prompts, similar to large language models.”
Performance Exceeding Larger Models
Florence-2 effectively executes a range of tasks—such as object detection, captioning, visual grounding, and visual question answering—when provided with image and text inputs. Notably, it achieves results comparable to or better than many larger models.
For instance, in zero-shot captioning tests on the COCO dataset, both the 232M and 771M versions of Florence-2 surpassed DeepMind’s 80B parameter Flamingo model, scoring 133 and 135.6, respectively. They also outperformed Microsoft’s own Kosmos-2 model, which is specialized for visual grounding.
When fine-tuned with publicly annotated data, Florence-2 competes closely with larger specialist models in tasks such as visual question answering.
“The pre-trained Florence-2 backbone enhances performance on downstream tasks, like COCO object detection and instance segmentation, and ADE20K semantic segmentation, exceeding both supervised and self-supervised models,” the researchers stated. “Compared to pre-trained models on ImageNet, ours enhances training efficiency by 4X and significantly improves performance by 6.9, 5.5, and 5.9 points on the COCO and ADE20K datasets.”
Currently, both pre-trained and fine-tuned versions of Florence-2 (232M and 771M) are available on Hugging Face under the MIT license, allowing for unrestricted commercial and private usage.
It will be fascinating to see how developers leverage Florence-2 to eliminate the necessity for separate vision models for different tasks. These compact, task-agnostic models can streamline development and significantly reduce computing costs.