On July 18, the Shanghai Artificial Intelligence Laboratory unveiled the results of its comprehensive evaluation of seven AI large models using its open-source assessment system, SiNan. The evaluation tested these models on all subjects of the national college entrance examination (Gaokao). The standout performers included the Shusheng Puyu 2.0 Wenchuxing model, Alibaba's Tongyi Qwen 2-72B, and GPT-4o, which took the top three spots in both humanities and sciences.
All three AI models surpassed the "first-tier" and "second-tier" score thresholds, as measured against the highest score benchmark from Henan Province this year. Specifically, the Tongyi Qwen 2-72B achieved an impressive score of 546, making it the top performer in humanities, while the Puyu Wenchuxing model secured the highest score in sciences with 468.5. In humanities subjects such as Chinese, history, geography, and political science, these models demonstrated strong knowledge and understanding, exceeding the "first-tier" line. However, their performance in scientific subjects showed that they generally lagged behind, highlighting an area for improvement in mathematical reasoning skills.
This assessment featured several key characteristics:
1. Comprehensive Exam: The evaluation encompassed full exam papers, including questions with visual components, rather than focusing solely on specific question types.
2. Pre-Exam Open Source: All models tested were open to the public before the exam, reducing the chance of question leakage.
3. Expert Grading: Experienced educators who have assessed Gaokao papers meticulously graded the responses to ensure consistency with traditional scoring methods.
4. Total Transparency: All generated answers, codes, and evaluation results were fully accessible to the public.
Despite the remarkable performance of these large models, the grading teachers noted some significant gaps compared to human students. For instance, in subjective questions, the AI models often struggled to fully comprehend the prompts, leading to irrelevant answers. In math-related tasks, their solution processes were mechanical and lacked logical coherence, particularly in geometry problems where spatial reasoning often failed. Furthermore, AI models showed a shallow understanding of physics and chemistry experiments, often leading to incorrect applications of experimental tools. They also had a tendency to fabricate plausible-sounding yet entirely fictional content, causing confusion among graders.
Through this analysis of the AI “candidates,” the SiNan evaluation team identified several common issues with current large models: a lack of reflective reasoning, a tendency to produce fabricated content assertively, inadequate spatial imagination, and superficial understanding of physical and chemical processes.