At the re:Invent conference today, Amazon Web Services (AWS) announced the launch of SageMaker HyperPod, a cutting-edge service designed specifically for training and fine-tuning large language models (LLMs). SageMaker HyperPod is now generally available to users.
AWS has consistently invested in SageMaker, its platform for building, training, and deploying machine learning models, positioning it as a cornerstone of its machine learning strategy. In light of the rise of generative AI, leveraging SageMaker to simplify the training and optimization of LLMs is a logical progression.
"SageMaker HyperPod empowers users to create a distributed cluster with accelerated instances tailored for distributed training," explained Ankur Mehrotra, AWS General Manager for SageMaker, in an interview prior to the announcement. "The service facilitates efficient distribution of models and data across your cluster, significantly speeding up the training process."
Mehrotra highlighted that SageMaker HyperPod enables users to frequently save checkpoints, allowing for pauses to analyze and refine the training process without restarting from scratch. Additionally, the service incorporates various fail-safes to ensure that if a GPU malfunctions, the overall training cycle remains intact.
"For machine learning teams focused solely on model training, this offers a zero-touch experience, creating a self-healing cluster," Mehrotra noted. "These features can expedite the training of foundation models by up to 40%, which is a substantial advantage in terms of cost and time-to-market."
Users can choose to train using Amazon's custom Trainium and Trainium 2 chips or NVIDIA-based GPU instances, including those powered by the H100 processor. AWS claims that HyperPod can enhance the training speed by as much as 40%.
AWS has applied prior learnings from building LLMs on SageMaker, such as the Falcon 180B model, trained on a cluster of thousands of A100 GPUs. Mehrotra acknowledged that these experiences were instrumental in developing HyperPod.
Perplexity AI's co-founder and CEO, Aravind Srinivas, shared that his company received early access to the service during its private beta. Initially skeptical about AWS for model training, he explained, "There was a myth—unfounded—that AWS lacked great infrastructure for large model training. Without time for due diligence, we believed it." After connecting with AWS, their engineers encouraged them to test the service at no cost. Srinivas found AWS support accessible and appreciated the ample GPU resources for Perplexity’s needs, especially given the team's prior experience with AWS inference.
Srinivas emphasized that the AWS HyperPod team was dedicated to optimizing the interconnects that link NVIDIA graphics cards. "They focused on enhancing the primitives from NVIDIA that enable communication of gradients and parameters across various nodes," he said.
In summary, AWS's SageMaker HyperPod represents a significant advancement in the realm of LLM training, providing an optimized environment that emphasizes speed, efficiency, and user-friendliness for machine learning teams.