Large Language Models (LLMs) have demonstrated potential in addressing planning and reasoning tasks by exploring various solutions. Nonetheless, current methods can be slow, computationally intensive, and sometimes yield unreliable outcomes.
To address these challenges, researchers from Cornell University and IBM Research developed AutoToS, a technique that synergizes the planning capabilities of LLMs with the efficiency and precision of rule-based search algorithms. AutoToS minimizes human intervention and significantly reduces the computational expenses associated with solving planning problems, making it a viable solution for LLM applications requiring reasoned decision-making over extensive solution spaces.
Innovative Techniques for Planning
Interest in utilizing LLMs for planning issues has surged, leading to the creation of various methods. Among the most effective, Tree of Thoughts employs LLMs as a search algorithm to validate solutions and suggest corrections. However, these techniques face two critical challenges: a high demand for LLM calls, which can be costly, and a lack of guarantees regarding “completeness” and “soundness.” Completeness ensures that a solution will eventually be found if one exists, while soundness confirms that any solution provided is valid.
Thought of Search (ToS) proposes an alternative by leveraging LLMs to generate code for pivotal components of search algorithms: the successor function, which explores different nodes, and the goal function, which determines if the desired state has been reached. This method enhances efficiency by reducing the need for LLM involvement throughout the search process.
Michael Katz, a principal research staff member at IBM Research, explains, “Historically, the planning community either manually coded these components for new problems or generated them from planning language descriptions, which were either hand-coded or learned from data. We aimed to use large language models to generate code for search components from textual problem descriptions.”
The original ToS technique yielded promising advancements in the soundness and completeness of search algorithms but required human experts for feedback on the generated code, creating a bottleneck that hampered the algorithm’s speed.
Automating the Process with AutoToS
To tackle this limitation, AutoToS automates the feedback and debugging process leveraging unit tests and debugging statements, along with few-shot and chain-of-thought (CoT) prompting techniques.
AutoToS operates in several steps. Firstly, it supplies the LLM with a problem description and prompts it to generate code for the successor and goal functions. Next, unit tests assess the goal function, providing feedback for necessary revisions. Once the goal function passes testing, the algorithm conducts a limited breadth-first search to verify soundness and completeness, iterating until the functions meet all criteria. Finally, the validated functions are incorporated into a classic search algorithm, executing the full search efficiently.
Evaluation of AutoToS
The researchers assessed AutoToS across various planning and reasoning tasks, including BlocksWorld, Mini Crossword, and the 24 Game—where four integers must be combined arithmetically to total 24. They utilized diverse LLMs, including GPT-4o, Llama 2, and DeepSeek Coder, to analyze performance variations based on model size.
Their findings indicated that AutoToS enabled all models to identify and rectify code errors using feedback. Larger models generally produced accurate goal functions without feedback and required minimal iterations to enhance the successor function. Notably, GPT-4o-mini exhibited strong accuracy outcomes despite its smaller size.
The researchers noted, “With just a few calls to the language model, we demonstrate that we can obtain the search components without direct human feedback, ensuring soundness, completeness, and nearly 100% accuracy across all models and domains.” AutoToS drastically minimizes LLM calls in comparison to other approaches; for example, solving the 1,362 puzzles in the 24 Game dataset required roughly 100,000 calls to GPT-4 with previous methods, whereas AutoToS necessitated only 2.2 calls on average.
Katz remarked, “With these components, we can employ the standard BFS algorithm to solve all 1,362 games in under 2 seconds with complete accuracy, something previous methods could not achieve.”
Implications for Enterprise Applications
AutoToS holds significant potential for enterprise contexts requiring planning solutions. By reducing LLM usage costs and reliance on manual input, it allows experts to focus on high-level planning and goal specifications.
Katz emphasizes, “We hope AutoToS will enhance both the development and deployment of planning-based solutions, using language models to create verifiable search components and speeding up development while circumventing issues typical with LLM deployment.”
ToS and AutoToS exemplify neuro-symbolic AI, a hybrid approach that merges deep learning and rule-based systems to tackle complex challenges. This approach is increasingly recognized as an effective direction to address the shortcomings of current AI systems.
“I have no doubt about the future role of hybrid systems in AI,” stated Harsha Kokel, research scientist at IBM. “Current language models can be viewed as hybrid systems since they perform search to determine the next tokens.”
While ToS and AutoToS show considerable promise, further exploration remains essential.
“It’s exciting to witness how planning with natural language evolves, and how LLMs can enhance the integration of planning tools in decision-making processes, paving the way for future intelligent agents,” Kokel and Katz concluded. “We are eager to explore how the world knowledge of LLMs can enrich planning and action in real-world situations.”