Update — Monday, Sept. 9, 2024, at 11:02 am ET: Third-party evaluations have not been able to replicate the performance metrics previously shared by Matt Shumer, co-founder and CEO of AI writing startup HyperWrite, regarding Reflection 70B. As a result, Shumer faces accusations of fraud on X.
A new contender has emerged in the AI landscape: Matt Shumer has announced Reflection 70B, a large language model (LLM) based on Meta’s open-source Llama 3.1-70B Instruct. This model incorporates an innovative error self-correction technique, showcasing impressive performance in third-party benchmarks.
In a post on X, Shumer declared Reflection 70B “the world’s top open-source AI model.” He shared a benchmark performance chart, highlighting the model's superior results.
Rigorous Testing and Performance
Reflection 70B has undergone extensive testing using benchmarks like MMLU and HumanEval, with LMSys’s LLM Decontaminator ensuring results are contamination-free. The findings demonstrate that Reflection consistently outperforms models from Meta’s Llama series and competes closely with leading commercial models.
Users can experience this model firsthand on the demo site. However, Shumer noted that the announcement has generated significant traffic, and his team is swiftly sourcing additional GPUs to accommodate demand.
Unique Capabilities of Reflection 70B
Shumer emphasized that Reflection 70B offers distinct advantages, particularly in error identification and correction. He explained, “LLMs often hallucinate without the ability to course-correct. What if an LLM could learn to recognize and correct its own mistakes?”
This insight led to the name “Reflection,” as the model can assess its outputs for accuracy before presenting them to the user. Its edge lies in "reflection tuning," a technique that allows it to identify shortcomings in its reasoning and rectify them before finalizing a response.
Reflection 70B introduces special tokens for structured reasoning and error correction, enabling seamless user interaction. During inference, the model provides reasoning outputs within designated tags, allowing for real-time corrections when it identifies errors.
The playground demo includes suggested prompts, such as counting the letter “r” in “Strawberry” and determining which number is larger, 9.11 or 9.9—tasks many AI models, including well-known proprietary ones, often miscalculate. In our tests, Reflection 70B eventually provided the correct answer after a brief delay.
This functionality makes the model particularly valuable for tasks requiring high accuracy, as it separates reasoning into distinct steps for enhanced precision. Reflection 70B is available for download via Hugging Face, with API access expected later today through Hyperbolic Labs.
Anticipation for Reflection 405B
The release of Reflection 70B is just the beginning. Shumer announced that an even larger model, Reflection 405B, will debut next week. He mentioned ongoing efforts to integrate Reflection 70B into HyperWrite’s primary AI writing assistant product, stating, “I’ll share more on this soon.”
Reflection 405B aims to surpass even the top closed-source models currently available. Shumer also indicated that a detailed report on the training process and benchmarks will be published, offering insights into the innovations behind the Reflection series.
Built on Meta’s Llama 3.1 70B Instruct, Reflection 70B maintains compatibility with existing tools and pipelines through the Llama chat format.
Contribution of Synthetic Data by Glaive
A vital factor in the success of Reflection 70B is the synthetic data generated by Glaive, a startup focused on creating use-case-specific datasets. Glaive’s platform enables the rapid training of small, targeted language models, addressing a significant bottleneck in AI development: the availability of high-quality, task-specific data.
By producing synthetic datasets tailored to specific needs, Glaive allows companies to fine-tune models efficiently and economically. The company has previously delivered success with smaller models, such as a 3B parameter model that outperformed larger open-source counterparts in tasks like HumanEval. Spark Capital has backed Glaive with a $3.5 million seed investment, supporting its vision of a democratized AI ecosystem.
Leveraging Glaive’s technology, the Reflection team generated high-quality synthetic data, dramatically accelerating development. According to Shumer, the training process took three weeks, involving five iterations of the model, with a custom dataset built using Glaive’s systems.
Background of HyperWrite
While it may seem that Reflection 70B appeared suddenly, Shumer has been entrenched in the AI sector for years. He co-founded what was initially called Otherside AI in 2020 with Jason Kuperberg in Melville, New York. The company gained traction with HyperWrite, its flagship product, which evolved from a Chrome extension for crafting emails into a comprehensive AI writing assistant capable of drafting essays and organizing emails. As of November 2023, HyperWrite boasted two million users, earning its founders a spot on Forbes' “30 Under 30” List.
In March 2023, HyperWrite secured $2.8 million from investors, including Madrona Venture Group, enabling the introduction of innovative AI-driven features that transform web browsers into virtual assistants handling various tasks.
Shumer emphasizes that accuracy and safety remain paramount for HyperWrite, especially as it delves into complex automation. The platform continually refines its personal assistant tool, reflecting the same care for precision and responsibility found in Reflection 70B.
Future Prospects for HyperWrite and Reflection Models
Looking forward, Shumer plans even bigger advancements for the Reflection series. With the imminent launch of Reflection 405B, he believes it will significantly outstrip the performance of proprietary models like OpenAI’s GPT-4o.
This poses challenges not only for OpenAI, which is reportedly seeking substantial new investments from major players like Nvidia and Apple, but also for other closed-source model providers such as Anthropic and Microsoft.
As the generative AI landscape evolves, the balance of power is shifting once again. The debut of Reflection 70B marks a pivotal moment for open-source AI, granting developers and researchers access to a powerful tool that rivals proprietary models. With its innovative approach to reasoning and error correction, Reflection may establish a new benchmark for the capabilities of open-source models.