Acquiring GPU hardware is the easy part. Turning it into a productive, multi-tenant AI service with proper isolation, self-service provisioning, and the governance to operate it at scale is where most get stuck. Custom integration work piles up, timelines slip, and the gap between racked hardware and revenue widens.
Rafay is closing that gap through a new integration with the NVIDIA NCX Infrastructure Controller (NICo), NVIDIA's open-source component for automated bare-metal lifecycle management. Together, Rafay and NICo give operators a unified platform to manage their GPU fleet to deliver cloud-like, self-service experiences to end users.
A Smarter Foundation for Multi-Tenancy
Traditional bare-metal multi-tenancy relies on network configuration at the switch layer i.e. creating VPCs, subnets, and tenant isolation through switch-level APIs. This works, but it introduces operational complexity that grows with every new tenant and every new rack.
NICo changes the model. Network configuration moves directly to the host, implemented through the NVIDIA BlueField DPU on each server. The DPU operates in zero trust mode: the host operating system cannot configure the DPU directly. It remains owned and controlled by the service provider, while the host is handed to the tenant. This means that even if a tenant's workload or OS is compromised, the network and management planes stay secure and isolated, a meaningful improvement over conventional bare-metal multi-tenancy.
The result is a network isolation model that is both more secure and dramatically simpler to operate at scale.
Where Rafay Comes In
NICo provides the hardware automation layer. Rafay sits on top of NICo's host-based networking model and uses it as the foundation for delivering multi-tenancy at scale — without the operational overhead that traditionally limits how many tenants a team can serve.
Bare-metal provisioning. Rafay leverages NICo's provisioning APIs to automate the full node lifecycle from zero-touch discovery and hardware validation through OS imaging and tenant delivery. What previously required manual intervention or fragmented scripts is now a fully automated, repeatable workflow triggered directly from the Rafay platform.
Self-service provisioning. Rafay abstracts NICo's APIs into simple workflows. A developer requests a GPU environment; Rafay triggers the NICo workflow to provision and deliver a ready-to-use node, fully isolated at the host network layer — no manual operator steps, no switch changes.
Standardized service SKUs. Operators define SKUs that encode server configuration, OS image, networking, and security controls. Because tenant network isolation is handled by the DPU rather than the switch, those SKUs are faster to deliver and easier to replicate consistently across tenants.
Enterprise governance. Rafay adds RBAC, resource quotas, and audit logging across the entire fleet. Every provisioning event is tracked. Only authorized users can access specific resources — enforced at both the platform and the network layer.
Cluster assembly. Once a node is provisioned and network-isolated, Rafay can automatically install the Kubernetes or SLURM stack, GPU drivers, and AI software needed to start work immediately.
Summary
The hardest part of building an AI infrastructure platform is operationalizing hardware at scale. Rafay's integration with the NVIDIA NCX Infrastructure Controller makes this tractable combining NICo's host-based networking and lifecycle automation with Rafay's orchestration and governance layer to deliver secure, scalable multi-tenancy without the complexity that has historically made bare-metal GPU services difficult to operate.
Scaling Trust: The Fortanix and Rafay Integration for Enterprise Confidential AI
Learn how the Fortanix and Rafay integration enables confidential AI for enterprises—protecting sensitive data while running AI workloads on secure, governed GPU platforms.
Rafay Launches AI Grid Orchestration Solution to Help Telcos Intelligently Deploy Distributed AI Infrastructure
Rafay, a member of the NVIDIA Inception program, brings infrastructure orchestration and workload automation to AI Grid architectures, enabling telcos and service providers to transform distributed GPU environments into a governed, self-service platform.
From Infrastructure Validation to Market Validation: Rafay and NVIDIA DSX Air
Cloud service providers and enterprises that move fast, validate early, and get AI services in front of customers quickly will define the next era of AI infrastructure. NVIDIA DSX Air gives teams a pre-production simulation, to get a head start on the competition. Rafay makes that head start count by letting cloud service providers simulate business use cases and get customer feedback well before accelerated computing hardware is deployed.