Enhancing large language models (LLMs) with knowledge extending beyond their training data is crucial for enterprise applications.
A prominent approach to integrating domain-specific and customer knowledge into LLMs is retrieval-augmented generation (RAG). However, basic RAG methods often fall short.
Are You Ready for AI Agents?
Building effective data-augmented LLM applications requires careful consideration of several factors. In a recent study by Microsoft researchers, a framework is proposed for categorizing different types of RAG tasks based on the type of external data needed and the reasoning complexity involved.
“Data-augmented LLM applications are not a one-size-fits-all solution," the researchers note. "Real-world demands, especially in expert domains, are intricate and can vary significantly in their relationship with the provided data and the reasoning required.”
To navigate this complexity, the researchers suggest a four-level categorization of user queries:
- Explicit Facts: Queries requiring retrieval of directly stated facts from the data.
- Implicit Facts: Queries needing inference of unstated information, often involving basic reasoning.
- Interpretable Rationales: Queries that necessitate understanding and applying explicit domain-specific rules from external resources.
- Hidden Rationales: Queries requiring the uncovering of implicit reasoning methods not stated in the data.
Each query level presents unique challenges and necessitates tailored solutions.
Categories of Data-Augmented LLM Applications
Explicit Fact Queries
These queries focus on straightforward retrieval of factual information explicitly stated in the data. The defining characteristic is the direct dependency on specific external data pieces.
Basic RAG is commonly employed here, where the LLM retrieves relevant information from a knowledge base to generate a response. However, challenges arise at every stage of the RAG pipeline. For instance, during indexing, the RAG system must manage large, unstructured datasets that may include multi-modal elements like images and tables. Multi-modal document parsing and embedding models can help map the semantic context of textual and non-textual elements into a shared space.
At the information retrieval stage, relevance of retrieved data is critical. Developers can align queries with document stores, using synthetic answers to enhance retrieval accuracy. Additionally, at the answer generation stage, fine-tuning allows the LLM to discern relevant information and ignore noise from the knowledge base.
Implicit Fact Queries
These queries require LLMs to reason beyond mere retrieval. For example, a user might ask, “How many products did company X sell in the last quarter?” or “What are the main differences between the strategies of company X and company Y?” These questions necessitate multi-hop question answering, involving data from multiple sources.
The complexity of implicit fact queries mandates advanced RAG techniques, such as Interleaving Retrieval with Chain-of-Thought (IRCoT) and Retrieval Augmented Thought (RAT). Knowledge graphs combined with LLMs also offer a structured method for complex reasoning, linking disparate concepts effectively.
Interpretable Rationale Queries
These queries require LLMs to apply domain-specific rules alongside factual content. “Interpretable rationale queries represent a straightforward category relying on external data for rationales,” the researchers explain. This type often involves clear guidelines or thought processes relevant to specific problems.
A customer service chatbot, for instance, may need to integrate documented protocols for handling returns with customer context. Integrating these rationales into LLMs can be challenging, necessitating prompt tuning techniques, including reinforcement learning and optimized prompt evaluations.
Hidden Rationale Queries
These present the most significant challenge, as they involve reasoning methods embedded within the data but not explicitly stated. For instance, the model may need to analyze historical data to extract patterns applicable to a current issue.
“Navigating hidden rationale queries… demands sophisticated analytical techniques to decode and leverage the latent wisdom embedded within disparate data sources,” the researchers observe.
Effective solutions for these queries can involve in-context learning to train LLMs on selecting and extracting relevant information. Domain-specific fine-tuning may also be essential, enabling the model to engage in complex reasoning and discern what external data is necessary.
Implications for Building LLM Applications
The Microsoft Research survey and framework illustrate the evolution of LLMs in utilizing external data for practical applications, while also highlighting the outstanding challenges. Enterprises can leverage this framework to make informed decisions about the integration of external knowledge into their LLMs.
While RAG techniques address many limitations of basic LLMs, developers must remain cognizant of the capabilities and constraints of their chosen methods, upgrading to more sophisticated systems as needed.