Google Enhances AI Hypercomputer for Enterprise Applications at Cloud Next

In December 2023, Google unveiled its “AI Hypercomputer,” a pioneering supercomputer architecture that integrates performance-optimized hardware, open software, leading machine learning frameworks, and flexible consumption models. The initiative aims to enhance efficiency and productivity across AI training, tuning, and serving for Google Cloud customers, competing with Microsoft and Amazon for enterprise market share.

Google Cloud customers can access this AI Hypercomputer virtually, allowing them to train their own AI models and applications. Notably, clients like Salesforce and Lightricks have successfully leveraged Google Cloud’s TPU v5p AI Hypercomputer for training large AI models.

At Google Cloud Next 2024, the annual conference in Las Vegas, Google presented significant upgrades to its AI Hypercomputer, highlighting an increase in high-profile customers utilizing the platform.

Enhancements to Google Cloud AI Hypercomputer

The first major upgrade involves the availability of Google Cloud’s Tensor Processing Unit (TPU) v5p—its most powerful, scalable, and flexible AI accelerator. Additionally, Google is enhancing its A3 virtual machine (VM) family, introducing A3 Mega configurations powered by NVIDIA H100 Tensor Core GPUs, set to launch in May. A3 Mega VMs will utilize these advanced GPUs, each containing 80 billion transistors.

Furthermore, Google plans to integrate Nvidia’s latest Blackwell GPUs, boosting support for high-performance computing (HPC) and AI workloads. This includes virtual machines featuring Nvidia HGX B200 and GB200 NVL72 GPUs, specifically designed for demanding AI and data analytics tasks. The liquid-cooled GB200 NVL72 GPUs will deliver real-time LLM inference and support large-scale training for trillion-parameter models.

While trillion-parameter AI models are still emerging—such as SambaNova and Google’s Switch Transformer—chip manufacturers like Nvidia and Cerebras are racing to develop hardware for these increasing model sizes.

Notable Google Cloud clients like Character.AI, a chatbot company valued at over $1 billion, are already experiencing benefits from the current A3 setup. CEO Noam Shazeer emphasized that their use of Google Cloud’s TPUs and A3 VMs allows for faster and more efficient training and inference of large language models (LLMs). He noted the potential for over 2X cost-efficient performance from the new generation of platforms.

Introducing JetStream for Enhanced AI Performance

On the software side, Google Cloud has launched JetStream, an inference engine optimized for large language models. This tool enhances performance per dollar on open models and supports frameworks like JAX and PyTorch/XLA, improving efficiency while decreasing costs.

Upgraded Storage Solutions for AI Workloads

Google’s storage solutions are also receiving enhancements. The introduction of caching features will position data closer to compute instances, speeding up AI training, optimizing GPU and TPU efficiency, and increasing energy cost-effectiveness. Notably, Hyperdisk ML, a new block storage service, improves AI inference and serving workflows, delivering up to 12X faster model load times.

Additional upgrades include Cloud Storage FUSE, which boosts training throughput by 2.9X, and Parallelstore, allowing caching that accelerates training speeds by up to 3.9X compared to traditional data loaders. The Filestore system enables simultaneous data access across GPUs and TPUs, enhancing training times by as much as 56%.

Collaborations and Software Upgrades

Google is also fostering new collaborations and introducing scalable implementations for diffusion and language models built on JAX. Support for open-source code from PyTorch/XLA 2.3 will improve distributed training scalability through features like auto-sharding and asynchronous checkpointing.

In partnership with Hugging Face, Google Cloud’s Optimum-TPU enables customers to optimize training and serving of AI models on Google’s TPUs. Additionally, Google will offer NVIDIA NIM inference microservices, providing developers with flexible options for AI training and deployment.

To facilitate usage, Google Cloud introduces a Dynamic Workload Scheduler, allowing customers to reserve GPUs for 14-day intervals, optimizing costs for AI workloads.

These updates exemplify the practical business benefits stemming from Google’s research and innovative solutions, creating an integrated, efficient, and scalable environment for AI training and inference.

As for pricing for the AI Hypercomputer offerings, details remain undisclosed. It will be crucial to see how this platform competes against Microsoft Azure and AWS for enterprise AI development and whether Google can sustain its commitment to improve and extensively support the AI Hypercomputer.

Most people like

Find AI tools in YBX