Researchers Challenge AI's "Reasoning" Skills: Models Struggle with Simple Math Problems Due to Minor Changes

How do machine learning models function, and do they "think" or "reason" as humans do? This question straddles both philosophical and practical realms. A newly circulated paper suggests a clear answer, at least for now: “no”.

A team of AI researchers from Apple has published a paper titled “Understanding the Limitations of Mathematical Reasoning in Large Language Models,” which has sparked discussions since its release on Thursday. While the intricate details delve into symbolic learning and pattern recognition, the core findings are quite straightforward.

Consider this simple math problem:

Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he picked on Friday. How many kiwis does Oliver have?

The answer is simple: 44 + 58 + (44 * 2) = 190. Although large language models (LLMs) occasionally struggle with arithmetic, they typically handle straightforward questions like this well. But what happens when we introduce an irrelevant detail?

Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?

You might think it's the same math problem. A child would understand that a small kiwi is still a kiwi. Yet, this additional information trips up even the most advanced LLMs. For instance, here's how GPT-o1-mini responds:

“On Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday’s kiwis) – 5 (smaller kiwis) = 83 kiwis.”

This is just one example among many questions that the researchers altered slightly, leading to significant declines in success rates for the models tackling them.

Why is this the case? Why do models that seem to understand a problem get flustered by a minor, irrelevant detail? The researchers argue this consistent failure suggests these models don't truly comprehend the problem. While their training data may lead them to the correct answer in some cases, they struggle to “reason” when faced with even the slightest complexity, like deciding whether to count smaller kiwis, resulting in unexpected outcomes.

The researchers state in their paper: “We investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize this decline arises because current LLMs lack genuine logical reasoning; they merely attempt to replicate the reasoning steps found in their training data.”

This observation aligns with attributes commonly ascribed to LLMs. When “I love you” is statistically followed by “I love you, too,” the model can easily mimic that response—but it doesn’t imply it truly feels love. While it can navigate complex reasoning chains previously encountered, this ability falters with even minor deviations, suggesting it replicates observed patterns rather than genuinely reasons.

Co-author Mehrdad Farajtabar effectively summarizes the paper in a thread on X.

An OpenAI researcher expressed respect for Mirzadeh and colleagues’ work but questioned their conclusions. They argued that correct responses could be achieved through thoughtful prompt engineering. Farajtabar, responding with the collegiality typical of researchers, noted that improved prompting might help with simple modifications but could require significantly more context to address complex distractions—ones a child would easily identify.

So, do LLMs reason? Perhaps. Can they reason? No one knows for sure. These concepts aren’t clearly defined and often emerge at the forefront of AI research, where advancements occur daily. It’s possible LLMs “reason” in ways we don’t yet understand or control.

This topic presents a fascinating area of research, but it also raises important questions about how AI is marketed. Can these systems deliver on their promises? And if they can, how exactly do they achieve this? As AI becomes an integral part of daily software, these inquiries transition from academic discussions to real-world considerations.

Most people like

Find AI tools in YBX