In recent years, artificial intelligence (AI) has made remarkable strides across various fields, notably through large language models (LLMs) that generate human-like text and, in some tasks, surpass human performance. However, researchers have raised concerns about the reasoning capabilities of LLMs, revealing that these models can make errors in simple mathematical problems when slight modifications are introduced. This suggests they may not possess genuine logical reasoning skills.
On Thursday, a team of researchers from Apple published a paper titled “Understanding the Limitations of Mathematical Reasoning in Large Language Models,” which exposes LLMs' susceptibility to interference when tackling mathematical challenges. The researchers tested LLMs by making small alterations to math problems, such as adding irrelevant information, to evaluate their reasoning abilities. The results indicated a significant drop in performance with these changes.
For example, when given a straightforward math question—“Oliver picked 44 kiwis on Friday, 58 on Saturday, and on Sunday, he harvested twice as many as on Friday. How many kiwis did Oliver pick in total?”—the LLM correctly calculated the answer. However, once the researchers introduced an unrelated detail—“On Sunday, he picked twice as many as on Friday, with 5 being smaller than average”—the model produced an erroneous response. In this instance, GPT-01-mini answered: “…On Sunday, 5 kiwis were smaller than average. We need to subtract them from Sunday’s total: 88 (Sunday's kiwis) – 5 (smaller kiwis) = 83 kiwis.”
This example highlights a broader trend; the researchers modified hundreds of problems, nearly all of which led to a significant decline in the models' accuracy. They concluded that LLMs do not genuinely comprehend mathematical queries but instead predict responses based on patterns found in their training data. When true reasoning is required, such as determining how to account for the smaller kiwis, the models produce perplexing and nonsensical results.
This discovery carries significant implications for AI development. While LLMs demonstrate excellence in many areas, their reasoning abilities are limited. Going forward, researchers must explore ways to enhance LLMs’ reasoning capabilities, enabling them to better understand and solve complex problems.