All generative AI models, from Google’s Gemini to Anthropic’s Claude and the recent stealth release of OpenAI’s GPT-4o, experience hallucinations. This means they can sometimes mislead users, resulting in both amusing and concerning outcomes. However, the rate of these inaccuracies varies across different models, largely depending on the data sources they've been trained on.
A recent study conducted by researchers from Cornell, the University of Washington, the University of Waterloo, and the nonprofit research institute AI2 aimed to evaluate these hallucinations by comparing models like GPT-4o against authoritative sources on various subjects, including law, health, history, and geography. Their findings revealed that no model excelled in every category, with the least hallucinatory models often opting not to answer questions they could get wrong.
“The crucial insight from our research is that we cannot fully trust the outputs of these models yet,” said Wenting Zhao, a doctoral student at Cornell and co-author of the study. “Currently, even the best models only generate hallucination-free text about 35% of the time.”
Previous academic efforts have also examined the “factuality” of generative AI models, including research by another AI2-affiliated team. Zhao points out that earlier assessments mainly focused on straightforward questions whose answers could be easily found on Wikipedia, which isn’t the most rigorous test since most models are trained on Wikipedia data.
To challenge the models more effectively and to better reflect the types of questions users might ask, the researchers identified topics across the internet that lack Wikipedia references. More than half of the questions in their benchmark were ones that could not be found on Wikipedia, covering diverse areas such as culture, geography, astronomy, pop culture, finance, medicine, computer science, and celebrities.
In their research, the team evaluated over a dozen popular models, many launched in the past year, including GPT-4o, “open” models like Meta’s Llama 3 70B and Mistral’s Mixtral 8x22B, as well as API-restricted models like Perplexity’s Sonar Large, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Opus.
The results indicated that models are not hallucinating significantly less than before, despite claims from major generative AI players like OpenAI and Anthropic. GPT-4o and OpenAI’s older model, GPT-3.5, had comparable performance regarding the accuracy of their responses, with GPT-4o performing slightly better. Overall, OpenAI’s models were the least prone to hallucinations, followed by Mixtral 8x22B, Command R, and Perplexity's Sonar models.
Questions related to celebrities and finance posed the greatest challenges for the models, while geography and computer science questions were easier, likely due to more extensive training data in these areas. When the source of an answer wasn’t Wikipedia, models struggled significantly more, particularly GPT-3.5 and GPT-4o, illustrating their heavy reliance on Wikipedia content.
Even models capable of web searching, such as Command R and Perplexity’s Sonar, faced difficulties with “non-Wiki” questions in the benchmark. The size of the model didn’t significantly influence hallucination rates; smaller models like Anthropic’s Claude 3 Haiku hallucinates similarly to larger, supposedly more advanced models like Claude 3 Opus.
What does this mean for the future of generative AI, and what improvements have been made? While it may be easy to assume that vendors exaggerate their advancements, it’s also possible that the benchmarks used do not adequately reflect current capabilities. As discussed in prior research, many AI evaluations lack context and may be susceptible to Goodhart’s law.
Regardless, Zhao anticipates that the hallucination issue will “persist for a long time.” “Our empirical results suggest that despite promising methods to reduce or eliminate hallucinations, the actual improvements are limited,” she said. “Moreover, our analysis highlights that even internet-sourced knowledge can be conflicting, partly due to the human-authored nature of training data that may contain inaccuracies.”
A potential interim solution could involve programming models to decline to answer questions more frequently — akin to advising an overly talkative person to quiet down. In the tests conducted, Claude 3 Haiku responded to only about 72% of questions, opting for abstention in the rest. When adjusted for these abstentions, Claude 3 Haiku emerged as the most factual model, at least in terms of delivering the fewest falsehoods.
But will users gravitate toward a model that leaves many questions unanswered? Zhao believes not and emphasizes that vendors should dedicate more resources to research aimed at reducing hallucinations. While completely eliminating hallucinations may be unrealistic, improvements can be made through human-in-the-loop fact-checking and citation practices during model development.
“Policies and regulations must be established to involve human experts in verifying and validating the information provided by generative AI models,” Zhao concluded. “The field still has substantial opportunities for advancement, including creating robust fact-checking tools, ensuring citations for factual information, and providing corrections for inaccuracies.”