Large language models (LLMs) have demonstrated impressive capabilities in reasoning and problem-solving. However, their reasoning processes and limitations remain topics of debate.
In a recent study conducted by researchers at the University of California, Los Angeles, and Amazon, the abilities of LLMs in deductive and inductive reasoning were comprehensively evaluated. Their findings reveal that while LLMs excel at identifying task rules from solved examples, they struggle with adhering to specific instructions. This insight is crucial for understanding the application of LLMs in reasoning-intensive tasks.
Understanding Reasoning Types: Deductive vs. Inductive
Reasoning can be segmented into two main categories: deductive and inductive. Deductive reasoning, often referred to as “top-down” logic, begins with a general principle and applies it to derive specific conclusions. For example, using a formula for converting Celsius to Fahrenheit enables calculations of temperature conversion.
In contrast, inductive reasoning employs a “bottom-up” approach. It involves examining specific instances and drawing broader conclusions or patterns, such as deriving a temperature conversion formula from observed Celsius and Fahrenheit values.
Both reasoning types are vital for intelligence and involve different cognitive processes. Most existing evaluations of LLMs' reasoning abilities, however, do not distinctly articulate their inductive and deductive skills.
A New Framework for Evaluating LLM Reasoning
The study introduced a series of experiments designed to systematically assess LLMs’ inductive and deductive reasoning capabilities. By structuring tasks to emphasize either inductive or deductive reasoning, the researchers ensured consistency in their comparisons.
For instance, during arithmetic tasks, LLMs were tested on their ability to apply a given mathematical function (deductive reasoning) versus inferring that function from input-output examples (inductive reasoning).
To further clarify these reasoning processes, the researchers developed SolverLearner, a two-step framework aimed at isolating and assessing the inductive reasoning capabilities of LLMs.
In the first step, SolverLearner prompts the LLM to generate a function mapping input data points to corresponding output values based solely on provided examples, focusing on the LLM’s pattern recognition abilities.
In the second step, an external code interpreter executes the proposed function on new test data. This separation prevents deductive reasoning from influencing the assessment of inductive reasoning.
Contrasting Strengths of LLMs in Reasoning
Using SolverLearner, the researchers evaluated the inductive and deductive reasoning capabilities of GPT-3.5 and GPT-4 across various tasks, including syntactic reasoning, arithmetic, and spatial reasoning.
Results indicated that both models displayed exceptional inductive reasoning capabilities, achieving near-perfect accuracy in tasks requiring learning from examples. However, they found it challenging to apply specific rules, especially in unconventional scenarios or counterfactual reasoning tasks. For instance, while they performed well with base ten arithmetic, they struggled with non-standard numerical bases like 9 and 11.
These findings suggest that LLMs are generally more adept at learning from examples and identifying patterns than following explicit instructions. This has significant implications for their real-world usage. While LLMs may seem capable of adhering to logical directives, their performance is vulnerable to deviations from their training data.
SolverLearner offers a method to ensure the correct rule mapping of inputs to outputs, although its effectiveness is contingent on the availability of a verification mechanism, such as a code interpreter.
This study highlights the complexities and limitations of LLMs, illustrating that our understanding of these sophisticated models remains incomplete.