Benchmarking in AI Development
Benchmarking is an essential step in enhancing artificial intelligence, providing a clear assessment of an AI's capabilities while allowing researchers to gauge its performance on specific tasks. However, traditional benchmarking has its limitations. Once an algorithm becomes adept at a static dataset, researchers must invest significant time creating new benchmarks to further advance the AI. As AI technology evolves, the demand for new benchmarks has increased. For instance, it took the research community approximately 18 years to reach human-level performance on the MNIST dataset and roughly six years to surpass humans on ImageNet. In contrast, only one year was needed to exceed human performance on the GLUE benchmark for language understanding.
Furthermore, existing benchmarks can harbor biases that algorithms might exploit, often leading to inaccurate evaluations. For example, image recognition AIs can overlook subtle contextual differences, such as misunderstanding "how much" versus "how many," and simply respond with "2."
In response to these challenges, Facebook's AI Research (FAIR) lab has introduced a novel approach to benchmarking by incorporating human input directly into the training of their natural language processing (NLP) models. This initiative, named Dynabench (short for "dynamic benchmarking"), involves humans interacting with NLP algorithms through probing and linguistically challenging questions, designed to test the models' capabilities and identify weaknesses. The fewer times the algorithm is fooled, the better its performance.
Compared to static benchmarks, this dynamic system minimizes issues like saturation and bias, enabling more accurate measurements that reflect real-world applications. According to FAIR researcher Douwe Kiela, “The process cannot saturate, it will be less prone to bias and artifacts, and it allows us to measure performance in ways that are closer to the real-world applications we care most about.”
One significant advantage of Dynabench is its accessibility; anyone can participate by logging into the Dynabench portal and engaging with a range of NLP models, requiring only basic English proficiency. Looking ahead, Kiela and his team aim to enhance the system's capabilities by integrating more models, modalities, and languages.