Meta AI researchers have unveiled OpenEQA, an innovative open-source benchmark dataset designed to assess an artificial intelligence system's proficiency in "embodied question answering." This capability allows AI to grasp the real world and respond accurately to natural language inquiries about various environments.
Positioned as a pivotal resource for the emerging field of "embodied AI," the OpenEQA dataset comprises over 1,600 questions pertaining to more than 180 real-world environments, such as homes and offices. These questions are categorized into seven distinct types to rigorously evaluate an AI's skills in object and attribute recognition, spatial reasoning, functional reasoning, and commonsense knowledge.
"Embodied Question Answering (EQA) serves as both a meaningful application and a framework for assessing an AI agent’s understanding of the world," the researchers noted in their publication. "EQA entails comprehending an environment sufficiently to answer questions about it in natural language."
Notably, even advanced models like GPT-4V have faced challenges in matching human performance on OpenEQA, reflecting the benchmark's rigor in evaluating an AI's ability to comprehend and respond to real-world questions.
Uniting diverse fields of AI
The OpenEQA initiative bridges several cutting-edge domains in artificial intelligence, including computer vision, natural language processing, knowledge representation, and robotics. The ultimate goal is to create artificial agents capable of perceiving and interacting with their surroundings, engaging in natural conversations with humans, and leveraging knowledge to enhance daily life.
Researchers envision two primary applications for this "embodied intelligence." First, AI assistants integrated into augmented reality glasses or headsets could leverage video and sensor data to provide users with a photographic memory, answering questions like, “Where did I leave my keys?” Second, mobile robots could autonomously navigate environments to gather information, such as determining, “Do I have any coffee left?”
Establishing a rigorous evaluation standard
In developing the OpenEQA dataset, Meta researchers began by collecting video footage and 3D scans of real-world settings. They then invited individuals to formulate questions they would pose to an AI assistant with access to that visual data.
The dataset includes 1,636 questions that assess a broad range of perception and reasoning skills. For instance, answering "How many chairs are around the dining table?" requires the AI to identify objects, comprehend the spatial term "around," and count the relevant items. Other inquiries necessitate a fundamental understanding of object uses and attributes.
To enhance accuracy, each question features multiple human-generated answers, acknowledging that diverse responses are possible. To evaluate AI performance, researchers utilized large language models to automatically gauge the similarity between AI-generated answers and human responses.