A new artificial intelligence benchmark, GAIA, is designed to assess whether chatbots like ChatGPT can exhibit human-like reasoning and skills in everyday tasks.
Developed by a team from Meta, Hugging Face, AutoGPT, and GenAI, GAIA presents real-world questions that require fundamental abilities such as reasoning, handling multiple modalities, web browsing, and tool proficiency, according to the researchers’ paper published on arXiv.
The researchers assert that GAIA questions are “conceptually simple for humans yet challenging for most advanced AIs.” In their tests, human participants scored an impressive 92 percent, while GPT-4 with plugins managed only 15 percent.
"This notable performance disparity contrasts with the recent trend of large language models [LLMs] outperforming humans in specialized tasks such as law or chemistry,” the authors state.
GAIA Focuses on Human-like Competence, Not Expertise
Unlike traditional benchmarks that emphasize tasks difficult for humans, the researchers advocate for a focus on tasks that reveal an AI's capacity to match the average human's robustness. The GAIA team crafted 466 real-world questions with clear answers. Of these, 300 are kept private to contribute to a public GAIA leaderboard, while 166 questions and answers are available as a development set.
"Solving GAIA would represent a milestone in AI research," says lead author Grégoire Mialon of Meta AI. "We believe that overcoming the challenges presented by GAIA is a key step toward the next generation of AI systems."
The Human vs. AI Performance Gap
Currently, the highest GAIA score is held by GPT-4 with manually selected plugins, achieving 30% accuracy. The benchmark creators suggest that an AI capable of solving GAIA could be classified as possessing artificial general intelligence (AGI) within a reasonable timeframe.
“The paper critiques the trend of testing AIs with complex math, science, and law exams, noting that tasks which pose challenges for humans are not necessarily difficult for modern systems,” the authors explain.
GAIA emphasizes practical questions such as “Which city hosted the 2022 Eurovision Song Contest according to the official website?” and “How many images are listed in the latest 2022 Lego Wikipedia article?”
“We argue that the development of AGI depends on a system's ability to demonstrate similar robustness to the average human on such everyday questions,” the researchers wrote.
GAIA's Potential Impact on AI Development
The introduction of GAIA signals a significant shift in AI research, with potential far-reaching effects. By emphasizing human-like competence in daily tasks instead of just specialized knowledge, GAIA pushes the boundaries of current AI benchmarks.
If future AI systems can demonstrate common sense, adaptability, and reasoning as measured by GAIA, it suggests they may achieve practical AGI. This could lead to enhanced AI assistants, services, and products.
However, the researchers caution that today's chatbots still face considerable challenges in solving GAIA, reflecting existing limitations in reasoning, tool utilization, and managing diverse real-world scenarios.
As researchers tackle the GAIA challenge, their findings will illuminate progress toward creating more competent, versatile, and trustworthy AI systems. Moreover, benchmarks like GAIA encourage critical thinking about how AI can be shaped to prioritize human values such as empathy, creativity, and ethical decision-making.
For those interested, the GAIA benchmark leaderboard provides insights into which next-generation LLM is currently excelling in this evaluation.