"Questions Arise Over Performance of New Open Source AI Leader Reflection 70B, Accused of ‘Fraud’"

In just one weekend, the new contender in open source AI models faced significant scrutiny, casting doubt on its reputation.

Reflection 70B, a variant of Meta’s Llama 3.1 large language model released by the New York startup HyperWrite (formerly OthersideAI), was hailed for achieving impressive benchmarks. However, subsequent evaluations by independent testers raised questions about the validity of these claims.

On September 6, 2024, HyperWrite co-founder Matt Shumer proclaimed Reflection 70B as "the world's top open-source model" in a post on the social network X. Shumer detailed the model's use of "Reflection Tuning," a technique that enables LLMs to verify the accuracy of their outputs before presenting them to users, enhancing performance in various domains.

However, by September 7, an organization called Artificial Analysis publicly challenged this assertion. Their analysis cited that Reflection 70B achieved the same MMLU score as Llama 3 70B but fell notably short compared to Meta’s Llama 3.1 70B. This created a stark contrast to HyperWrite’s initial results.

Shumer later admitted that the model's weights were compromised during the upload process to Hugging Face, which could explain the discrepancies in performance when compared to internal tests.

On September 8, after testing a private API, Artificial Analysis acknowledged they observed impressive yet unverified results that did not meet HyperWrite's original claims. They also posed critical questions about the release of an untested version of the model and the lack of published model weights for the private API version.

Community members across AI-focused Reddit threads also voiced skepticism regarding Reflection 70B’s performance and origins. Some claimed it appeared to be a variant of Llama 3 rather than the anticipated Llama 3.1, raising further doubts about its legitimacy. One user even accused Shumer of perpetrating "fraud in the AI research community."

Despite the backlash, some users defended Reflection 70B, citing strong performance in their use cases. However, the rapid transition from excitement to criticism highlights the volatile nature of the AI landscape.

For 48 hours, the AI research community awaited updates from Shumer on the model’s performance and corrected weights. On September 10, he finally addressed the controversy, saying:

"I got ahead of myself with this announcement, and I apologize. We made decisions based on the information we had. I know many are excited about this potential yet skeptical. A team is working diligently to ascertain what occurred. Once we've clarified the facts, we'll maintain transparency with the community."

Shumer referenced a post from Sahil Chaudhary, founder of Glaive AI, who validated the confusion around the model’s claims and remarked on the difficulty in reproducing benchmark scores.

Chaudhary stated:

“I want to address the valid criticisms. I’m investigating the situation and will provide a transparent summary soon. At no point did I run models from other providers, and I aim to explain the discrepancies, including unexpected behaviors like skipping certain terms. I have much to uncover regarding the benchmarks and I appreciate the community's patience as I rebuild trust.”

The situation remains unresolved, with continued skepticism surrounding both Reflection 70B and its claims within the open-source AI community.

Most people like

Find AI tools in YBX