Researchers have identified a significant drawback in the evolution of advanced chatbots. While AI models become more accurate over time, they also tend to answer questions outside their expertise rather than admitting uncertainty. This leads users to take their confident yet incorrect answers at face value, perpetuating a cycle of misinformation. “They are answering almost everything these days,” says José Hernández-Orallo, a professor at the Universitat Politecnica de Valencia, Spain. “That means more correct answers, but also more incorrect ones.”
Hernández-Orallo, the lead on this study conducted with colleagues at the Valencian Research Institute for Artificial Intelligence, explored three families of large language models (LLMs): OpenAI’s GPT series, Meta’s LLaMA, and the open-source BLOOM. The team examined a range of models, starting from the relatively basic GPT-3 ada and moving through to the more advanced GPT-4, which was released in March 2023. Notably, the latest versions, GPT-4o and o1-preview, were not included in their analysis.
The researchers assessed each model with thousands of questions across various topics including arithmetic, geography, and science, as well as tasks like alphabetizing lists. They categorized prompts by their perceived difficulty. The results revealed that as the models advanced, the frequency of incorrect answers increased, indicating that more sophisticated chatbots resemble overconfident professors who believe they hold the answers to every query.
Human interaction further complicates the problem. Volunteers tasked with evaluating the accuracy of the AI outputs often misclassified wrong answers as correct, with error rates ranging from 10 to 40 percent. Hernández-Orallo concluded, “Humans are not able to supervise these models effectively.”
To mitigate this issue, the research team suggests that AI developers focus on enhancing performance in easier tasks and program chatbots to refrain from attempting more complex inquiries. “We need people to recognize: ‘I can use it in this area, and I shouldn’t use it in that area,’” Hernández-Orallo added.
While this is a prudent suggestion, there may be little incentive for AI companies to adopt it. Chatbots that frequently admit to not knowing answers might seem less advanced or valuable, resulting in decreased usage and revenue for the developers. Consequently, we continue to see disclaimers indicating that “ChatGPT can make mistakes” or that “Gemini may display inaccurate information.”
Ultimately, it becomes our responsibility to scrutinize and verify the answers provided by chatbots to avoid disseminating incorrect information that could cause harm. For the sake of accuracy, always fact-check your chatbot’s responses.