On June 19, the Shanghai Artificial Intelligence Laboratory released the results of the first comprehensive evaluation of AI performance on China's national college entrance examination, known as the Gaokao. Following the conclusion of the 2024 Gaokao, the laboratory utilized its OpenCompass assessment system, testing six open-source models, including GPT-4o, across the language, mathematics, and English sections of the exam. All models were launched before the exam, ensuring a closed-book evaluation, with experienced teachers conducting manual scoring to align with realistic grading standards.
The results showed that Qwen2-72B, GPT-4o, and InternLM2-20B-WQX ranked as the top three models, each scoring above 70%. While most models performed well in the Chinese and English sections, they demonstrated significant challenges in mathematics. Notably, InternLM2-20B-WQX achieved the highest score in math, surpassing GPT-4o.
To ensure fairness, commercial closed-source models were excluded from the evaluation. The participating models included several recent open-source releases, such as Mixtral 8x22B, Yi-1.5-34B, and GLM-4-9B, all launched between April and June 2024.
The total possible score across language, mathematics, and English was 420 points. According to the results, Qwen2-72B secured the top position with 303 points, followed by GPT-4o at 296 points, and InternLM2-20B-WQX in third. However, the average score in mathematics was notably low at only 36%, indicating that all participating models failed to meet passing standards.
Teachers provided valuable insights regarding the overall model performance. They noted that while the modern reading comprehension skills of the language models were generally strong, there was considerable variability in the understanding of classical texts. Additionally, the essays generated by the models tended to follow structured question-and-answer formats, lacking the sophistication typical of human responses.
In mathematics, the models delivered responses that were often disorganized and sometimes incorrect, with some generating accurate answers despite flawed reasoning. While the models showed strong recall of formulas, they struggled with flexible problem-solving. English performance was satisfactory overall, but some models had difficulties with specific question types due to adaptation challenges. Interestingly, many AI-generated English essays exceeded the word count limit, contrasting with human candidates who frequently lost points for being too brief.
This evaluation represents a critical advancement in understanding AI capabilities within educational contexts and sets the stage for future enhancements in model development.