Accelerating the AI Factory: Rafay & NVIDIA NCX Infra Controller (NICo)

March 17, 2026
Ankur Pandita
Ankur Pandita
No items found.

Acquiring GPU hardware is the easy part. Turning it into a productive, multi-tenant AI service with proper isolation, self-service provisioning, and the governance to operate it at scale is where most get stuck. Custom integration work piles up, timelines slip, and the gap between racked hardware and revenue widens.

Rafay is closing that gap through a new integration with the NVIDIA NCX Infrastructure Controller (NICo), NVIDIA's open-source component for automated bare-metal lifecycle management. Together, Rafay and NICo give operators a unified platform to manage their GPU fleet to deliver cloud-like, self-service experiences to end users.

A Smarter Foundation for Multi-Tenancy

Traditional bare-metal multi-tenancy relies on network configuration at the switch layer i.e. creating VPCs, subnets, and tenant isolation through switch-level APIs. This works, but it introduces operational complexity that grows with every new tenant and every new rack.

NICo changes the model. Network configuration moves directly to the host, implemented through the NVIDIA BlueField DPU on each server. The DPU operates in zero trust mode: the host operating system cannot configure the DPU directly. It remains owned and controlled by the service provider, while the host is handed to the tenant. This means that even if a tenant's workload or OS is compromised, the network and management planes stay secure and isolated, a meaningful improvement over conventional bare-metal multi-tenancy.

The result is a network isolation model that is both more secure and dramatically simpler to operate at scale.

Where Rafay Comes In

NICo provides the hardware automation layer. Rafay sits on top of NICo's host-based networking model and uses it as the foundation for delivering multi-tenancy at scale — without the operational overhead that traditionally limits how many tenants a team can serve.

Bare-metal provisioning. Rafay leverages NICo's provisioning APIs to automate the full node lifecycle from zero-touch discovery and hardware validation through OS imaging and tenant delivery. What previously required manual intervention or fragmented scripts is now a fully automated, repeatable workflow triggered directly from the Rafay platform.

Self-service provisioning. Rafay abstracts NICo's APIs into simple workflows. A developer requests a GPU environment; Rafay triggers the NICo workflow to provision and deliver a ready-to-use node, fully isolated at the host network layer — no manual operator steps, no switch changes.

Standardized service SKUs. Operators define SKUs that encode server configuration, OS image, networking, and security controls. Because tenant network isolation is handled by the DPU rather than the switch, those SKUs are faster to deliver and easier to replicate consistently across tenants.

Enterprise governance. Rafay adds RBAC, resource quotas, and audit logging across the entire fleet. Every provisioning event is tracked. Only authorized users can access specific resources — enforced at both the platform and the network layer.

Cluster assembly. Once a node is provisioned and network-isolated, Rafay can automatically install the Kubernetes or SLURM stack, GPU drivers, and AI software needed to start work immediately.

Rafay as Part of the NVIDIA AI Cloud-Ready ISV Validation Initiative

Building on this foundation, Rafay is a validated ISV under the NVIDIA AI Cloud-Ready ISV Validation Initiative, where solutions are assessed against the NCP Software Reference Guide across networking, compute, orchestration, and AI platform layers.

This initiative brings together infrastructure and platform software providers aligned to NVIDIA’s reference architecture, ensuring solutions are validated for production-scale AI factories, real workloads, and consistent deployment at scale.

Within this ecosystem, Rafay serves as the orchestration and consumption layer that turns validated infrastructure into secure, multi-tenant, self-service AI services.

Learn more about NVIDIA's ISV Validation Programme

Summary

The hardest part of building an AI infrastructure platform is operationalizing hardware at scale. Rafay's integration with the NVIDIA NCX Infrastructure Controller makes this tractable combining NICo's host-based networking and lifecycle automation with Rafay's orchestration and governance layer to deliver secure, scalable multi-tenancy without the complexity that has historically made bare-metal GPU services difficult to operate.

For more information, see the NVIDIA NCX Infrastructure Controller documentation.

Share this post

Want a deeper dive in the Rafay Platform?

Book time with an expert.

Book a demo
Tags:

You might be also be interested in...

News

Token Factory Is Now Generally Available: How AI Factory Operators Can Monetize Token-Based AI Services

Rafay Token Factory enables AI factory operators to monetize GPU infrastructure with token-based AI APIs, metering, and self-service consumption at scale.

Read Now

Product

Introduction to Disaggregated Inference: Why It Matters

Learn how disaggregated inference improves GPU utilization, scalability, and cost efficiency by separating compute, memory, and serving layers—enabling more flexible, self-service AI infrastructure.

Read Now

Product

Running GPU Infrastructure on Kubernetes: What Enterprise Platform Teams Must Get Right

Scaling GPUs on Kubernetes is a governance problem, where utilization, cost control, and access define success.

Read Now