AI agents are emerging as a significant area of research, offering numerous real-world applications. These agents leverage foundation models—including large language models (LLMs) and vision-language models (VLMs)—to interpret natural language instructions and autonomously pursue complex objectives. By utilizing tools such as browsers, search engines, and code compilers, AI agents can verify their actions and reason about their goals.
However, a recent analysis from Princeton University highlights key shortcomings in current benchmark and evaluation practices for AI agents, which undermine their potential for practical use.
Challenges in Agent Benchmarking
The researchers emphasize that traditional benchmarking approaches for foundation models do not apply to AI agents. A notable issue is the lack of cost control in agent evaluations. Running AI agents can be far more costly than single model calls since they often depend on stochastic language models that yield varying results for the same input.
To enhance accuracy, many AI systems generate multiple responses and apply methods like voting or external verification tools to identify the most effective answer. While sampling hundreds or thousands of responses can significantly boost an agent's accuracy, it incurs hefty computational costs. In research scenarios, researchers often prioritize accuracy over cost; however, practical applications impose strict budget constraints for each query, making it vital to instill cost controls in agent evaluations. Failure to incorporate these would likely lead researchers to create expensive agents merely to achieve top rankings. The Princeton team recommends visualizing evaluation outcomes through a Pareto curve for accuracy versus inference cost and using techniques that optimize both metrics cohesively.
Their analysis of various prompting techniques and agentic patterns across different studies reveals a staggering discrepancy: “For substantially similar accuracy, the cost can differ by almost two orders of magnitude,” they state. “Yet, the cost of running these agents isn’t a top-line metric reported in any of these papers.”
The researchers argue that optimizing for both cost and accuracy can produce agents that not only maintain high performance but also reduce expenses. For instance, developers might invest more resources in refining an agent's design while minimizing variable costs by limiting in-context learning examples.
In their tests on HotpotQA, a benchmark for question-answering tasks, they demonstrate that joint optimization leads to an ideal balance between accuracy and inference costs.
“Useful agent evaluations must control for cost—even if our primary goal is to identify innovative agent designs,” the researchers assert. “Focusing solely on accuracy fails to reveal genuine progress, as it can be artificially inflated through scientifically trivial methods like retrying.”
From Research to Real-World Applications
Another challenge identified is the distinction between evaluating models for research and developing them for real-world applications. In research environments, accuracy often takes precedence, while inference costs can be neglected. However, when building AI agent applications, these costs significantly influence the selection of models and techniques.
Assessing inference costs for AI agents poses a challenge, considering the variable charges from different model providers for similar services. Moreover, API call costs frequently fluctuate, possibly depending on bulk call pricing structures.
To tackle this concern, the researchers have launched a website that adjusts model comparisons according to token pricing.
They also conducted a case study on NovelQA, which evaluates question-answering for lengthy texts. Their findings reveal that benchmarks created for model evaluation can misleadingly impact downstream evaluations. The original NovelQA study, for instance, inaccurately portrayed retrieval-augmented generation (RAG) as less effective than long-context models in practical situations. Their research indicates that both RAG and long-context models offer similar accuracy, but the latter is significantly more expensive—by a factor of 20.
The Overfitting Issue
The tendency for AI models to exploit shortcuts—like “overfitting”—to succeed in benchmarks is another serious concern. The Princeton study found that agent benchmarks, typically comprising very few samples, are especially prone to overfitting. This is more problematic than data contamination in training foundation models, as agents can directly access knowledge about test samples.
To counteract this, the researchers recommend developing and maintaining holdout test sets of examples that cannot be memorized during training and necessitate a thorough understanding of the target tasks. In their examination of 17 benchmarks, many lacked adequate holdout datasets, enabling agents to inadvertently exploit shortcuts.
“Surprisingly, we find that many agent benchmarks do not include held-out test sets,” the researchers noted. They advocate for benchmark developers to maintain the secrecy of test sets to prevent contamination and overfitting.
They also stress that varying types of holdout samples are necessary based on the task's desired generality.
“Benchmark developers must strive to eliminate shortcuts,” the researchers emphasize. “This responsibility lies more with them than with agent developers, as creating benchmarks that discourage shortcuts is more straightforward than scrutinizing each agent for potential pitfalls.”
Analyzing WebArena, a benchmark assessing AI agents on various websites, the researchers identified multiple training dataset shortcuts leading to overfitting. For instance, agents could make assumptions about web address structures, neglecting the potential for future changes, which inflated accuracy estimates and fostered unrealistic optimism about agent capabilities.
As the field of AI agents evolves, both research and development communities must refine methods for rigorously testing these systems, which are poised to play an essential role in everyday applications.
“AI agent benchmarking is in its infancy, and best practices have yet to be established, complicating the differentiation between true progress and mere hype,” the researchers conclude. “Our thesis is that agents differ sufficiently from models necessitating a reevaluation of benchmarking practices.”