Leaderboard: OpenAI’s GPT-4 Achieves Lowest Rate of Hallucinations

Home AI News Leaderboard: OpenAI’s GPT-4 Achieves Lowest Rate of Hallucinations

Updated on October 25 2024

OpenAI's GPT-4 has emerged as the leading large language model (LLM) in minimizing hallucinations when summarizing documents, according to a recent assessment by Vectara. The company launched a comprehensive leaderboard on GitHub that benchmarked prominent LLMs using its Hallucination Evaluation Model. This model quantifies the frequency of hallucinations—instances where the AI generates inaccurate or fabricated information—during document summarizations.

Both GPT-4 and its variant, GPT-4 Turbo, achieved remarkable performance with the highest accuracy rate at 97% and a minimal hallucination rate of just 3%. Following closely behind was GPT-3.5 Turbo, which yielded an impressive accuracy of 96.5% and a slightly higher hallucination rate of 3.5%.

Among non-OpenAI contenders, Meta's 70 billion parameter version of Llama 2 distinguished itself. It achieved an accuracy score of 94.9% and a hallucination rate of only 5.1%. In stark contrast, models from Google performed poorly on the leaderboard. Google Palm 2 recorded an accuracy of 87.9% coupled with a 12.1% hallucination rate, while its chat-refined version dropped significantly, posting just a 72.8% accuracy and the highest hallucination rate of 27.2%.

Notably, Google Palm 2 Chat generated the highest average word count per summary at 221 words, whereas GPT-4 produced an average of 81 words per summary.

### Evaluation Methodology

Vectara’s evaluation, aimed at identifying hallucinations in LLM outputs, utilized open-source datasets. The company tested each model against 1,000 short documents, requesting summaries based solely on the content provided in those documents. However, only 831 of these documents were summarized by every model, as the remaining were filtered out due to content restrictions. For the documents shared among all models, Vectara calculated the overall accuracy and hallucination rates.

It's important to note that while the tested content was free of illicit and 'not safe for work' material, the presence of certain trigger words led to content restrictions from some models.

### Addressing Hallucination Challenges

The issue of hallucinations has been a significant barrier to the widespread adoption of generative AI within enterprises. Shane Connelly, head of product at Vectara, highlighted in a blog post the historical difficulty in quantifying hallucinations effectively. Previous attempts have often been too abstract or involved controversial subjects, limiting their practical application for businesses.

The Hallucination Evaluation Model created by Vectara is open-source, allowing organizations to use it to assess the reliability of their language models in Retrieval Augmented Generation (RAG) frameworks. This model is available through Hugging Face, enabling users to customize it according to their unique requirements.

As Connelly articulates, "Our goal is to equip enterprises with the insights necessary to confidently implement generative systems through thorough and quantified analysis." By providing a clearer understanding of AI outputs, businesses can better navigate the nuances of generative AI technology.

"Groundbreaking: Two AI Systems Successfully Negotiate Their Own Contract"

"OpenAI Pursued Merger with Anthropic: Sam Altman Dubbed the 'Martyr'"

Most people like

InVideo AI

10.1M

InVideo is a powerful online video editing platform that offers a diverse range of premium templates, high-quality images, and an extensive music library. Whether you're creating promotional content, social media videos, or personal projects, InVideo provides the tools you need to enhance your videos and engage your audience effectively.

Online Video Editor AI Video Editor

RevComm

45.6K

Revolutionize your communications with an AI-powered IP phone featuring advanced conversation analytics. Enhance your business interactions by leveraging cutting-edge technology that transforms calls into actionable insights.

AI-powered AI CRM Assistant

LEAFIO

90.9K

Maximize your retail inventory management with the LEAFIO AI Retail Platform. Streamline operations, enhance efficiency, and boost profitability through advanced AI-driven solutions designed specifically for retailers.

AI-driven supply chain AI Advertising Assistant

AirBrush

514.8K

Transform and perfect your photos effortlessly with AirBrush – the ultimate photo editing app! Whether you're looking to add a professional touch or simply polish your snapshots, AirBrush provides powerful tools for retouching. Discover the ease of enhancing your images today!

airbrush photos Photo & Image Editor

Find AI tools in YBX