AI Startup Outshines Google in Stanford's Latest Model Rankings

In a surprising twist, a foundation model from the startup Writer has surpassed Google in the latest performance rankings conducted by researchers at Stanford University. The Palmyra X V3 model, boasting 72 billion parameters, emerged as the highest-scoring non-OpenAI model on the Stanford leaderboard for the Holistic Evaluation of Language Models (HELM) Lite. Despite its smaller size, Palmyra outperformed several larger contenders, securing the third position overall, while Google’s PaLM 2 claimed the fourth spot.

Additionally, another standout model, Yi-34B, developed by the Chinese startup 01.ai under the leadership of visionary Kai-Fu Lee, made notable waves. This open-source 34 billion-parameter model, trained on an impressive three trillion tokens, outperformed models such as Mistral 7B, Anthropic’s Claude 2, and Meta’s Llama 2, earning a coveted place on Stanford's leaderboard.

As expected, OpenAI’s GPT-4 maintained its position at the top of the Stanford rankings with a significant lead. Released last March, GPT-4 excelled on multiple benchmarks, including OpenbookQA for elementary science questions, MMLU for generalized standardized exams, and LegalBench, which tests models on legal task performance. OpenAI’s GPT-4 Turbo followed suit, securing second place. Unveiled at DevDay 2023, GPT-4 Turbo was designed for operational efficiency, capable of processing 16 times more text than its predecessor. However, it fell short of GPT-4's performance due to difficulties in adhering to provided instructions.

Percy Liang, an associate professor at Stanford, remarked on the unexpected results, highlighting how smaller models have recently outshined larger ones. “Some recent models are very chatty; they sometimes provide the correct answer in the wrong format, even when instructed otherwise,” he noted.

The HELM Lite framework was intentionally designed to evaluate models on a lightweight yet comprehensive scale. Building on their earlier HELM framework, Stanford’s latest test specifically assessed model capabilities. The research team plans to introduce a new benchmark focused on model safety, developed in collaboration with MLCommons.

HELM Lite evaluates various competencies, including machine translation, medical diagnostics, and literature comprehension. This project drew inspiration from the Open LLM leaderboard on Hugging Face, where Yi-34B currently ranks first. It's important to note that the Stanford research team did not utilize closed-system models like GPT-4 and Claude. Instead, they accessed standard interfaces and meticulously crafted prompts to elicit outputs consistent with the desired format.

Most people like

Find AI tools in YBX