Hugging Face has launched an inference-as-a-service that leverages Nvidia NIM microservices, significantly enhancing token efficiency for developers.
This new offering promises up to five times better token efficiency with popular AI models, providing immediate access to NIM microservices on the Nvidia DGX Cloud. The announcement was made during Nvidia CEO Jensen Huang’s keynote at the Siggraph conference in Denver, Colorado.
With a thriving community of four million developers on Hugging Face, easy access to Nvidia-accelerated inference for leading AI models is now at their fingertips. The inference-as-a-service will enable rapid deployment and prototyping using powerful large language models such as the Llama 3 family and Mistral AI models, all optimized with Nvidia NIM microservices.
At the Siggraph conference, Hugging Face emphasized the benefits of their new services, targeting developers eager to integrate generative AI into their applications. Kari Briski, Vice President of Generative AI Software Product Management, highlighted the challenges developers face when trying to implement generative AI. She stated, “Developers seek straightforward methods to work with APIs, allowing them to prototype and assess model performance regarding accuracy and latency.”
To address these concerns, Nvidia is launching generative AI alongside NIM microservices, enhancing the existing AI training service on Hugging Face called Train on DGX Cloud. Developers now have access to a hub where they can easily compare and choose from a growing number of open-source models, facilitating experimentation, testing, and deployment of state-of-the-art models on Nvidia-accelerated infrastructure.
Accessing these tools is seamless through the “Train” and “Deploy” menus on Hugging Face model cards, enabling developers to get started within minutes.
Nvidia NIM Microservices for Inference-as-a-Service
Nvidia NIM consists of a suite of AI microservices optimized for inference, utilizing industry-standard application programming interfaces (APIs). This system enhances processing efficiency for tokens—the basic units of data generated by language models—and improves the performance of the underlying Nvidia DGX Cloud infrastructure, resulting in faster AI applications.
For instance, the 70-billion-parameter version of Llama 3 achieves up to five times greater throughput when accessed via NIM compared to standard deployments on Nvidia H100 Tensor Core GPU systems.
The Nvidia DGX Cloud platform is tailored for generative AI, offering reliable, scalable computing resources that expedite the development of production-ready applications. Developers can access GPU resources throughout the AI development lifecycle, from prototyping to production, without the need for long-term infrastructure commitments.
In summary, Hugging Face’s inference-as-a-service on Nvidia DGX Cloud enhances access to optimized computing resources, enabling users to explore the latest AI models within an enterprise-grade environment.
OpenUSD Framework Integration
At Siggraph, Nvidia also introduced generative AI models and NIM microservices for the OpenUSD framework, facilitating the creation of highly accurate virtual worlds and advancing the next evolution of AI in metaverse-like industrial applications.