Meta's recent research report reveals that its cluster of 16,384 NVIDIA H100 graphics cards, used to train the 40.5 billion parameter LLaMA 3 model, experienced 419 unexpected failures over 54 days—averaging one failure every three hours. More than half of these failures stemmed from the GPUs and their high-bandwidth memory (HBM3).
The large scale and highly synchronized tasks mean that a single GPU failure can disrupt the entire training process, requiring a restart. Despite this challenging environment, the Meta team maintained over 90% effective training time. During the 54-day pre-training period, they recorded a total of 466 interruptions, comprised of 47 planned and 419 unexpected disruptions. Planned interruptions were primarily due to automated maintenance, while unexpected failures were predominantly caused by hardware issues. Notably, GPU-related problems accounted for 58.7% of these unexpected interruptions.
Of the 419 unexpected failures, 148 (30.1%) were due to various GPU issues, including NVLink failures, while 72 (17.2%) were caused by faults in the GPU's HBM3 memory. Remarkably, there were only two CPU failures during the entire 54-day period. Additionally, 41.3% of unexpected interruptions were attributed to a combination of software errors, network cable issues, and problems with network adapters.
To enhance efficiency, the Meta team has developed numerous tools and optimization strategies. These include reducing task startup and checkpoint times, using PyTorch’s NCCL profiler to diagnose performance issues, and identifying underperforming GPUs. The team has also focused on the influence of environmental factors on GPU performance, such as temperature fluctuations during midday and the stress of running numerous GPUs simultaneously on the data center's power grid.
As the parameters of AI models continue to grow, so do the computational resources required. For instance, a planned cluster of 100,000 H100 GPUs by xAI could significantly increase failure rates, presenting greater challenges for future AI training endeavors.