There is no single speedometer to measure the performance of generative AI models, but a key metric is the number of tokens processed per second.
Today, SambaNova Systems announced a significant achievement in generative AI performance, reaching an impressive 1,000 tokens per second with the Llama 3 8B parameter instruction model. Previously, the fastest benchmark for Llama 3 was held by Groq at 800 tokens per second. This new milestone was independently verified by the testing firm Artificial Analysis. The increased processing speed has important implications for enterprises, potentially resulting in quicker response times, enhanced hardware utilization, and reduced operational costs.
A Race for AI Performance
“We are witnessing an acceleration in the AI chip race beyond expectations. We were excited to validate SambaNova’s claims with independent benchmarks focused on real-world performance,” said George Cameron, Co-Founder of Artificial Analysis. “AI developers now have a broader range of hardware options, which is especially beneficial for speed-dependent applications like AI agents and consumer AI solutions that require minimal response times and efficient document processing.”
How SambaNova Accelerates Llama 3 and Generative AI
SambaNova is dedicated to developing enterprise-focused generative AI solutions, featuring both hardware and software capabilities.
On the hardware side, the company has designed a unique AI chip known as the Reconfigurable Dataflow Unit (RDU). Similar to Nvidia's AI accelerators, RDUs are adept at both training and inference while specifically enhancing enterprise workloads and model fine-tuning. The latest model, the SN40L, was unveiled in September 2023.
SambaNova also offers a proprietary software stack that includes the Samba-1 model, launched on February 28. This model, comprising 1 trillion parameters, is referred to as Samba-CoE (Combination of Experts), allowing enterprises to utilize multiple models separately or in combination, customized to their data needs.
For the 1,000 tokens per second speed, SambaNova utilized its Samba-1 Turbo model, an API version made available for testing. The company plans to integrate these speed enhancements into its main enterprise model soon. However, Cameron noted that Groq’s 800 tokens per second measurement refers to its public API endpoint, whereas SambaNova’s results come from a dedicated private endpoint, making direct comparisons less straightforward.
“Nevertheless, this speed exceeds 8X the median output of other API providers we benchmarked and is several times faster than typical output rates on Nvidia H100s,” Cameron stated.
Reconfigurable Dataflow for Enhanced Performance
SambaNova’s performance is driven by its reconfigurable dataflow architecture, central to its RDU technology. This architecture allows for optimized resource allocation across neural network layers and kernels through compiler mapping.
“With dataflow, we can continually refine the model mappings since it’s fully reconfigurable,” said Rodrigo Liang, CEO and Founder of SambaNova. “This leads to not just incremental gains, but considerable improvements in efficiency and performance as software evolves.”
Initially, when Llama 3 was released, Liang's team achieved a performance of 330 tokens per second on Samba-1. Through extensive optimizations over recent months, this speed has now tripled to 1,000 tokens per second. Liang explained that optimization involves balancing resource distribution among kernels to prevent bottlenecks and maximize overall throughput within the neural network pipeline, which is similar to the approach taken in SambaNova’s software stack to assist enterprises in their fine-tuning efforts.
Enterprise Quality and Higher Speed
Liang emphasized that SambaNova achieves this speed milestone using 16-bit precision, a standard that ensures the quality enterprises require.
He stated, "We’ve consistently utilized 16-bit precision for our customers, as they prioritize quality and minimizing hallucinations in outputs."
The importance of speed for enterprise users is growing as organizations increasingly adopt AI agent-driven workflows. Moreover, faster generation times offer economic advantages.
“The quicker we can generate responses, the more available resources we free up for others to use,” he noted. “Ultimately, this leads to a more compact infrastructure and cost savings.”