Nvidia is not alone in the AI accelerator landscape; Intel is making significant strides with its Gaudi 2 technology, as highlighted in new research by Databricks.
The research reveals that Intel Gaudi 2 competes robustly against Nvidia's leading AI accelerators. For large language model (LLM) inference, Gaudi 2 matches the latency of Nvidia H100 systems on decoding and surpasses the performance of the Nvidia A100. Additionally, Gaudi 2 achieves higher memory bandwidth utilization than both the H100 and A100.
While Nvidia's top-tier accelerators still offer superior training performance, Databricks found that Gaudi 2 provides the second-fastest single-node LLM training performance after the Nvidia H100, delivering over 260 TFLOPS per chip. Notably, based on public cloud pricing, Gaudi 2 offers the best dollar-per-performance ratio for both training and inference compared to the A100 and H100.
Intel is also sharing its Gaudi 2 testing results through the MLcommons MLPerf benchmark for training and inference, further validating the technology's performance through third-party data. "We were impressed by Gaudi 2's efficiency, particularly in LLM inference," said Abhinav Venigalla, Databricks' lead NLP architect. He noted that the team did not have time to fully explore the performance benefits of Gaudi 2's FP8 support in the latest software release.
Intel's insights align with the findings from Databricks. Eitan Medina, COO at Habana Labs (an Intel subsidiary), stated that the report corroborates Intel's internal performance metrics and customer feedback. “Validating our claims is essential, especially as many consider Gaudi to be Intel’s best-kept secret,” he remarked, emphasizing the importance of such publications to increase visibility.
Since acquiring Habana Labs and its Gaudi technology in 2019 for $2 billion, Intel has consistently enhanced its capabilities. Both Intel and Nvidia actively participate in the MLcommons MLPerf benchmarks, which are refreshed regularly. The latest MLPerf 3.1 benchmarks, released in November, showcased new LLM training speed records for both companies, complemented by competitive performance in the September inference benchmarks.
While benchmarks like MLPerf are insightful, Medina pointed out that many customers prioritize their testing to ensure compatibility with specific models and use cases. “The maturity of the software stack is crucial, as clients are sometimes skeptical of benchmarks where vendors heavily optimize for specific metrics,” he said. He sees MLPerf results as a valuable initial filter before companies invest further time in testing.
Looking ahead, Intel is gearing up to introduce the Gaudi 3 AI accelerator in 2024. Gaudi 3, built on a 5-nanometer process, promises to deliver four times the processing power and double the network bandwidth compared to Gaudi 2. Medina asserted, “Gaudi 3 represents a significant leap in performance, enhancing performance per dollar and per watt.”
Beyond Gaudi 3, Intel plans to develop future generations that will integrate high-performance computing (HPC) and AI accelerator technologies. The company also recognizes the importance of its CPU technologies for AI inference workloads, recently announcing the 5th Gen Xeon processors with AI acceleration. “CPUs still play a crucial role in inference and fine-tuning tasks, especially when combined with Gaudi accelerators for high-density AI compute workloads,” Medina concluded, advocating for a diverse range of solutions.