Open Source: Cambridge Team's First Pre-trained General Multi-modal Post-Interaction Knowledge Retriever

PreFLMR Model: Advanced Multi-modal Knowledge Retriever for RAG Applications

The PreFLMR model is a cutting-edge multi-modal knowledge retriever specifically designed to enhance Retrieval-Augmented Generation (RAG) applications. Based on the Fine-grained Late-interaction Multi-modal Retriever (FLMR) from NeurIPS 2023, it boasts improvements and extensive pre-training on the M2KR dataset.

Project Overview

Despite the impressive capabilities of modern multi-modal large models (like GPT-4 Vision and Gemini), they often struggle with expertise-driven inquiries. For example, even GPT-4 Vision can falter in addressing complex, knowledge-focused questions. With PreFLMR, these models can access relevant knowledge effectively, ensuring accurate responses.

RAG offers a simple yet powerful way to transform multi-modal models into "domain experts." It utilizes a lightweight knowledge retriever to obtain relevant information from specialized databases, such as corporate knowledge bases or Wikipedia. By integrating knowledge with queries, these models generate precise answers. The effectiveness of multi-modal knowledge retrieval is largely dependent on the retriever's capability to recall important information.

The Cambridge University Artificial Intelligence Lab has recently open-sourced the PreFLMR model, marking the first pre-trained universal multi-modal late-interaction knowledge retriever. Here are its key features:

1. Multi-modal Retrieval Versatility: PreFLMR excels in various document retrieval tasks, including text-to-text, image-to-text, and knowledge retrieval. Following extensive pre-training on millions of multi-modal datasets, it demonstrates exceptional performance across downstream retrieval applications. With minimal additional training on proprietary data, it can swiftly adapt to become an effective domain-specific model.

2. In-depth Characterization: Unlike traditional Dense Passage Retrieval (DPR) systems that rely on a single vector for representation, PreFLMR employs a matrix representation using all token vectors, encompassing both text and image tokens. This methodology preserves critical fine-grained information essential for matching intricate queries and improves retrieval accuracy in multi-modal contexts.

3. Adaptive Document Extraction: PreFLMR adeptly retrieves documents according to user needs—whether extracting information related to a specific question or details about items depicted in an image. This capability dramatically enhances the performance of multi-modal models in professional knowledge question-answering tasks, allowing them to address queries integrating both images and text.

The Cambridge team has made available three model variations, with parameter counts ranging from PreFLMRViT-B (207M) to PreFLMRViT-G (2B), catering to diverse application requirements. Alongside the model launch, they introduced the Multi-task Multi-modal Knowledge Retrieval Benchmark (M2KR), which encompasses ten widely-studied retrieval tasks and features over a million retrieval pairs for training and evaluating universal knowledge retrievers.

M2KR Dataset Overview

The M2KR dataset is structured to support the training and evaluation of multi-modal retrieval models. It unifies ten public datasets into a consistent question-document retrieval format, covering tasks like image captioning and multi-modal dialogue.

PreFLMR Architecture

PreFLMR processes user queries at the token level, facilitating detailed interactions with the document matrix. For each query vector, it identifies the closest document vector, computes the dot product, and aggregates the highest scores to ascertain relevance, thereby preserving fine-grained details from individual tokens.

The model's pre-training consists of four phases:

1. Text Encoder Pre-training: Initial training involves MSMARCO, establishing a late-interaction retrieval model as the text encoder.

2. Image-Text Projection Layer Pre-training: Subsequently, the model undergoes training on M2KR, focusing on the image-text projection layer while keeping other components fixed to avoid over-dependence on text.

3. Continuous Pre-training: This phase involves training on a high-quality knowledge-intensive visual question-answering task from M2KR to bolster knowledge extraction capabilities.

4. General Retrieval Training: The final phase encompasses comprehensive training on the M2KR dataset, refining general retrieval capabilities, while selectively unlocking parameters for query and document encoders.

Experimental Results

PreFLMR has consistently outperformed baseline models across seven M2KR retrieval tasks. Utilizing the ViT-G image encoder paired with the ColBERT-base-v2 text encoder, which boasts two billion parameters, it showcases impressive results. Scaling the ViT image encoder from ViT-B to ViT-L significantly enhances performance, demonstrating the model's capacity to leverage increased visual encoder parameters.

Additionally, using PreFLMR for retrieval in knowledge-intensive visual question-answering tasks led to remarkable system performance improvements, achieving effectiveness increases of 94% and 275% on the Infoseek and EVQA tasks, respectively.

Conclusion

The PreFLMR model, developed by the Cambridge AI Lab, is the first open-source universal late-interaction multi-modal retriever. With extensive pre-training on the M2KR dataset, it exhibits robust capabilities across numerous retrieval tasks. Model weights, code, and the M2KR dataset are readily available on the project homepage: PreFLMR.

Most people like

Find AI tools in YBX