Unlock the Power of Google’s DataGemma AI: Your Ultimate Statistics Wizard

Home AI News Unlock the Power of Google’s DataGemma AI: Your Ultimate Statistics Wizard

Updated on September 12 2024

Google is broadening its AI model lineup to tackle critical challenges in the field. Today, the company introduced DataGemma, a suite of open-source, instruction-tuned models designed to reduce hallucinations—where large language models (LLMs) generate inaccurate responses—specifically in statistical queries.

Available on Hugging Face for research and academic purposes, these new models expand upon the existing Gemma family, utilizing extensive real-world data from Google's Data Commons platform. This public platform houses an open knowledge graph comprising over 240 billion data points sourced from reputable organizations across various sectors, including economics, science, and health.

Addressing Factual Hallucinations

LLMs have revolutionized technology, powering applications from code generation to customer support and optimizing resource use for enterprises. Despite their advancements, the issue of hallucinations—especially related to numerical and statistical data—persists.

According to Google researchers, factors contributing to this phenomenon include the probabilistic nature of LLM outputs and insufficient factual coverage in the training data. Traditional grounding techniques have struggled with statistical queries due to the varied schemas and formats in public data, requiring substantial context for accurate interpretation.

To bridge these gaps, researchers integrated Data Commons, one of the largest repositories of normalized public statistical data, with the Gemma family of language models, creating DataGemma.

Innovative Approaches for Enhanced Accuracy

DataGemma employs two distinct methods to improve factual accuracy:

1. Retrieval Interleaved Generation (RIG): This approach integrates factual accuracy by comparing the LLM's original output with relevant statistics from Data Commons. The refined LLM generates descriptive natural language queries that are converted into structured data queries, which retrieve statistically relevant answers, including citations.

2. Retrieval-Augmented Generation (RAG): This method enhances models by utilizing original statistical questions to extract relevant variables and form natural language queries directed at Data Commons. The extracted data, combined with the original question, is then used to prompt a long-context LLM (here, Gemini 1.5 Pro) for precise answer generation.

Promising Results in Testing

In preliminary tests involving 101 queries, the DataGemma models fine-tuned with RIG improved factual accuracy from the baseline by 5-17%, achieving approximately 58% accuracy. While RAG produced slightly lesser results, it still outperformed baseline models.

DataGemma successfully answered 24-29% of queries using statistical responses from Data Commons, maintaining 99% accuracy with numerical values. However, it faced challenges in drawing accurate inferences from the numbers between 6-20% of the time.

Both RIG and RAG techniques demonstrate effectiveness in enhancing model accuracy for statistical queries, particularly in research and decision-making contexts. RIG offers speed while RAG provides more extensive data yet depends on the availability of information and large context-handling capabilities.

Google aims to advance research on these methods through the public release of DataGemma with RIG and RAG.

The company stated, "Our research is ongoing, and we are committed to refining these methodologies as we scale up this work, ensuring rigorous testing, and integrating this enhanced functionality into both Gemma and Gemini models via a phased, limited-access approach."

Salesforce’s AgentForce: AI Assistants Ready to Transform Your Entire Business Operations

Forget GPT-5! OpenAI Unveils New AI Model Family o1, Boasting PhD-Level Performance

Most people like

CopilotKit

18.8K

Unlock effortless AI integration with powerful solutions.

AI integration AI Advertising Assistant

Vozo - AI Video Generator

176.7K

Unlock the potential of your video content with an AI video generator that simplifies and enhances the video transformation process. Whether you're looking to create engaging marketing materials, dynamic social media clips, or captivating educational videos, this innovative tool empowers you to produce high-quality results effortlessly. Embrace the future of video production and watch your creative ideas come to life with AI.

AI video generator AI Repurpose Assistant

Hypotenuse AI

367K

Hypotenuse AI is an advanced AI Writing Assistant that effortlessly creates high-quality content tailored to your specific keywords. Whether you're looking to enhance your blog, website, or marketing materials, Hypotenuse AI streamlines the writing process while ensuring your content is engaging and relevant.

AI writing assistant AI Content Generator

mixart.ai

14.1K

Effortlessly Transform and Create Stunning Photos with Mixart.ai's Cutting-Edge AI Tools

Other Text to Image

Find AI tools in YBX