Google is broadening its AI model lineup to tackle critical challenges in the field. Today, the company introduced DataGemma, a suite of open-source, instruction-tuned models designed to reduce hallucinations—where large language models (LLMs) generate inaccurate responses—specifically in statistical queries.
Available on Hugging Face for research and academic purposes, these new models expand upon the existing Gemma family, utilizing extensive real-world data from Google's Data Commons platform. This public platform houses an open knowledge graph comprising over 240 billion data points sourced from reputable organizations across various sectors, including economics, science, and health.
Addressing Factual Hallucinations
LLMs have revolutionized technology, powering applications from code generation to customer support and optimizing resource use for enterprises. Despite their advancements, the issue of hallucinations—especially related to numerical and statistical data—persists.
According to Google researchers, factors contributing to this phenomenon include the probabilistic nature of LLM outputs and insufficient factual coverage in the training data. Traditional grounding techniques have struggled with statistical queries due to the varied schemas and formats in public data, requiring substantial context for accurate interpretation.
To bridge these gaps, researchers integrated Data Commons, one of the largest repositories of normalized public statistical data, with the Gemma family of language models, creating DataGemma.
Innovative Approaches for Enhanced Accuracy
DataGemma employs two distinct methods to improve factual accuracy:
1. Retrieval Interleaved Generation (RIG): This approach integrates factual accuracy by comparing the LLM's original output with relevant statistics from Data Commons. The refined LLM generates descriptive natural language queries that are converted into structured data queries, which retrieve statistically relevant answers, including citations.
2. Retrieval-Augmented Generation (RAG): This method enhances models by utilizing original statistical questions to extract relevant variables and form natural language queries directed at Data Commons. The extracted data, combined with the original question, is then used to prompt a long-context LLM (here, Gemini 1.5 Pro) for precise answer generation.
Promising Results in Testing
In preliminary tests involving 101 queries, the DataGemma models fine-tuned with RIG improved factual accuracy from the baseline by 5-17%, achieving approximately 58% accuracy. While RAG produced slightly lesser results, it still outperformed baseline models.
DataGemma successfully answered 24-29% of queries using statistical responses from Data Commons, maintaining 99% accuracy with numerical values. However, it faced challenges in drawing accurate inferences from the numbers between 6-20% of the time.
Both RIG and RAG techniques demonstrate effectiveness in enhancing model accuracy for statistical queries, particularly in research and decision-making contexts. RIG offers speed while RAG provides more extensive data yet depends on the availability of information and large context-handling capabilities.
Google aims to advance research on these methods through the public release of DataGemma with RIG and RAG.
The company stated, "Our research is ongoing, and we are committed to refining these methodologies as we scale up this work, ensuring rigorous testing, and integrating this enhanced functionality into both Gemma and Gemini models via a phased, limited-access approach."