Salesforce, the enterprise software leader, has introduced a new suite of open-source large multimodal AI models, known as xGen-MM (or BLIP-3). This innovative release is likely to expedite advancements in the development of sophisticated artificial intelligence systems.
The xGen-MM framework, detailed in a paper published on arXiv by researchers at Salesforce AI Research, is comprised of pre-trained models, comprehensive datasets, and fine-tuning code. The largest model, featuring 4 billion parameters, demonstrates competitive performance across various benchmarks when compared to similar open-source models.
The authors state, “We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research.” This initiative signifies a shift away from the trend of proprietary models, potentially democratizing access to cutting-edge multimodal AI technology.
A schematic diagram of the xGen-MM (BLIP-3) framework illustrates its processing of interleaved image and text data. The model employs a Vision Transformer to encode images, a token sampler to condense visual information, and a pre-trained large language model to generate text, with relevant losses applied to text tokens.
Key to xGen-MM’s innovation is its capability to manage “interleaved data” that combines multiple images and text, seen by researchers as “the most natural form of multimodal data.” This skill allows the models to perform complex tasks, such as answering questions about numerous images simultaneously, making it invaluable across diverse fields like medical diagnostics and autonomous vehicles.
The release includes several model variants optimized for specific tasks: a base pre-trained model, an “instruction-tuned” version for adhering to directives, and a “safety-tuned” model aimed at minimizing harmful outputs. This selection reflects growing recognition in the AI community of the need to merge capability with ethical considerations.
Salesforce’s decision to open-source these models promises to significantly enhance innovation in the multimodal AI domain. By granting researchers and developers access to high-quality models and datasets, Salesforce creates opportunities for broader collaboration and advancement, contrasting with the closed strategies of some tech giants.
Nonetheless, the launch of such influential models raises critical questions about potential risks and societal impacts associated with advanced AI systems. While Salesforce has incorporated safety tuning to address these concerns, the wider ramifications of broadly accessible advanced AI models continue to stimulate discussions within the tech community and beyond.
The xGen-MM models were trained on extensive datasets curated by Salesforce, including a trillion-token dataset of interleaved image and text data known as “MINT-1T.” Additionally, new datasets targeting optical character recognition and visual grounding have been developed, which are essential for AI systems interacting naturally with the visual environment.
As AI technology becomes increasingly prevalent, Salesforce’s open-source initiative equips researchers with vital tools to enhance their understanding and development of these powerful systems. This move also establishes a benchmark for transparency in a field often critiqued for its opacity, potentially encouraging other tech companies to adopt similar practices with their AI research.
In an intensifying AI race, Salesforce’s open strategy could serve as a vital differentiator. By promoting a collaborative environment around its models, the company may foster faster innovation and cultivate positive relationships within the research community. However, the effectiveness of this approach in the competitive realm of enterprise AI solutions remains to be observed.
The code, models, and datasets for xGen-MM are accessible on Salesforce’s GitHub repository, with more resources expected on the project’s website soon. As researchers and developers engage with these models, the true impact of Salesforce’s contributions to multimodal AI will increasingly unfold in the coming months and years.