Meta Unveils GPU Clusters Powering the Training of Llama 3 AI Model

Meta has recently unveiled its advanced AI infrastructure, introducing new GPU clusters designed to elevate the training of next-generation models, including the highly anticipated Llama 3. In a detailed blog post, the company outlined the specifications and capabilities of its two state-of-the-art data center-scale clusters, which are engineered to handle larger and more intricate models than previous hardware.

Each new cluster is equipped with an impressive 24,576 Nvidia H100 GPUs, marking a significant upgrade from the earlier clusters that utilized approximately 16,000 Nvidia A100 GPUs. According to Omdia research published in 2023, Meta has emerged as one of Nvidia’s most substantial clients, acquiring thousands of H100s to bolster its AI initiatives.

These cutting-edge GPUs will play a pivotal role in training both current and future AI systems, particularly Llama 3, the successor to Llama 2. While specific details about Llama 3 remain under wraps, the blog post indicates that training for this next-generation model is actively underway. Additionally, Meta plans to leverage this robust infrastructure for ongoing AI research and development efforts.

One of Meta’s primary objectives is to realize the concept of Artificial General Intelligence (AGI), which its chief scientist, Yann LeCun, refers to as advanced machine intelligence. The recent blog post emphasizes that the scaling of these GPU clusters is integral to fulfilling its AGI ambitions. By the end of 2024, Meta intends to expand its AI infrastructure to include a staggering 350,000 Nvidia H100 GPUs. Coupled with innovative clustering techniques across its combined resources, this infrastructure is projected to deliver compute power equivalent to nearly 600,000 H100s.

Diving into the technical specifications, the new clusters were engineered using distinct network fabric solutions: one employs RDMA over converged Ethernet (RoCE) based on Arista 7800, while the other utilizes Nvidia Quantum2 InfiniBand fabric. Both solutions provide high-speed 400 Gbps endpoints. This dual approach allows Meta to evaluate the efficiency and scalability of various interconnect types for large-scale training, yielding insights that will inform the design and construction of even larger clusters in the future.

Additionally, the clusters are built on Grand Teton, Meta’s in-house designed, open GPU hardware platform, which enables the creation of purpose-built clusters tailored for specific applications. They also incorporate Meta’s Open Rack architecture, further enhancing their adaptability and efficiency.

This ambitious move signifies Meta's commitment to leading the frontier of artificial intelligence through innovative infrastructure and cutting-edge technology, setting the stage for advancements in AI capabilities and research.

Most people like

Find AI tools in YBX