OpenAI's GPT-4 has emerged as the leading large language model (LLM) in minimizing hallucinations when summarizing documents, according to a recent assessment by Vectara. The company launched a comprehensive leaderboard on GitHub that benchmarked prominent LLMs using its Hallucination Evaluation Model. This model quantifies the frequency of hallucinations—instances where the AI generates inaccurate or fabricated information—during document summarizations.
Both GPT-4 and its variant, GPT-4 Turbo, achieved remarkable performance with the highest accuracy rate at 97% and a minimal hallucination rate of just 3%. Following closely behind was GPT-3.5 Turbo, which yielded an impressive accuracy of 96.5% and a slightly higher hallucination rate of 3.5%.
Among non-OpenAI contenders, Meta's 70 billion parameter version of Llama 2 distinguished itself. It achieved an accuracy score of 94.9% and a hallucination rate of only 5.1%. In stark contrast, models from Google performed poorly on the leaderboard. Google Palm 2 recorded an accuracy of 87.9% coupled with a 12.1% hallucination rate, while its chat-refined version dropped significantly, posting just a 72.8% accuracy and the highest hallucination rate of 27.2%.
Notably, Google Palm 2 Chat generated the highest average word count per summary at 221 words, whereas GPT-4 produced an average of 81 words per summary.
### Evaluation Methodology
Vectara’s evaluation, aimed at identifying hallucinations in LLM outputs, utilized open-source datasets. The company tested each model against 1,000 short documents, requesting summaries based solely on the content provided in those documents. However, only 831 of these documents were summarized by every model, as the remaining were filtered out due to content restrictions. For the documents shared among all models, Vectara calculated the overall accuracy and hallucination rates.
It's important to note that while the tested content was free of illicit and 'not safe for work' material, the presence of certain trigger words led to content restrictions from some models.
### Addressing Hallucination Challenges
The issue of hallucinations has been a significant barrier to the widespread adoption of generative AI within enterprises. Shane Connelly, head of product at Vectara, highlighted in a blog post the historical difficulty in quantifying hallucinations effectively. Previous attempts have often been too abstract or involved controversial subjects, limiting their practical application for businesses.
The Hallucination Evaluation Model created by Vectara is open-source, allowing organizations to use it to assess the reliability of their language models in Retrieval Augmented Generation (RAG) frameworks. This model is available through Hugging Face, enabling users to customize it according to their unique requirements.
As Connelly articulates, "Our goal is to equip enterprises with the insights necessary to confidently implement generative systems through thorough and quantified analysis." By providing a clearer understanding of AI outputs, businesses can better navigate the nuances of generative AI technology.