Nvidia is set to significantly enhance the deployment of generative AI large language models (LLMs) through a groundbreaking approach for rapid inference.
During today’s Nvidia GTC event, the tech giant introduced Nvidia Inference Microservices (NIM), a software technology that bundles optimized inference engines, industry-standard APIs, and support for AI models into containers for seamless deployment. NIM not only offers prebuilt models but also enables organizations to incorporate their proprietary data and accelerates the deployment of Retrieval Augmented Generation (RAG).
The introduction of NIM represents a pivotal advancement in generative AI deployment, forming the backbone of Nvidia's next-generation inference strategy, which will impact nearly every model developer and data platform in the industry. Nvidia has collaborated with major software vendors, including SAP, Adobe, Cadence, and CrowdStrike, as well as various data platform providers such as BOX, Databricks, and Snowflake to support NIM.
NIM is part of the NVIDIA Enterprise AI software suite, which is being released as version 5.0 today at GTC.
“Nvidia NIM is the premier software package and runtime for developers, allowing them to focus on enterprise applications,” stated Manuvir Das, VP of Enterprise Computing at Nvidia.
What is Nvidia NIM?
At its core, NIM is a container filled with microservices. This container can host various models—from open to proprietary—that can operate on any Nvidia GPU, whether in the cloud or on a local machine. NIM can be deployed wherever container technologies are supported, including Kubernetes in the cloud, Linux servers, or serverless Function-as-a-Service models. Nvidia plans to offer the serverless function approach on its new ai.nvidia.com website, enabling developers to start working with NIM before deployment.
Importantly, NIM does not replace existing Nvidia model delivery methods. Instead, it provides a highly optimized model for Nvidia GPUs along with essential technologies for enhancing inference.
During the press briefing, Kari Briski, VP of Generative AI Software Product Management, reaffirmed Nvidia's commitment as a platform company. She highlighted that tools supporting inference, like TensorRT and Triton Inference Server, remain vital.
“Bringing these components together for a production environment to run generative AI at scale requires significant expertise, which is why we’ve packaged them together,” Briski explained.
NIMs to Enhance RAG Capabilities for Enterprises
A key application for NIMs lies in facilitating RAG deployment models.
“Nearly every client we've engaged with has implemented numerous RAGs,” Das noted. “The challenge is transitioning from prototyping to delivering tangible business value in production.”
Nvidia, along with leading data vendors, anticipates that NIMs will provide a viable solution. Vector database capabilities are crucial for enabling RAG, and several vendors—including Apache Lucene, Datastax, and Milvus—are integrating support for NIMs.
The RAG approach will be further enhanced through the integration of NVIDIA NeMo Retriever microservices within NIM deployments. Announced in November 2023, NeMo Retriever is designed to optimize data retrieval for RAG applications.
“When you incorporate a retriever that is both accelerated and trained on high-quality datasets, the impact is significant,” Briski added.