Back

Accelerating the AI Factory: Rafay & NVIDIA NCX Infra Controller (NICo)

March 17, 2026

No items found.

Acquiring GPU hardware is the easy part. Turning it into a productive, multi-tenant AI service with proper isolation, self-service provisioning, and the governance to operate it at scale is where most get stuck. Custom integration work piles up, timelines slip, and the gap between racked hardware and revenue widens.

Rafay is closing that gap through a new integration with the NVIDIA NCX Infrastructure Controller (NICo), NVIDIA's open-source component for automated bare-metal lifecycle management. Together, Rafay and NICo give operators a unified platform to manage their GPU fleet to deliver cloud-like, self-service experiences to end users.

A Smarter Foundation for Multi-Tenancy

Traditional bare-metal multi-tenancy relies on network configuration at the switch layer i.e. creating VPCs, subnets, and tenant isolation through switch-level APIs. This works, but it introduces operational complexity that grows with every new tenant and every new rack.

NICo changes the model. Network configuration moves directly to the host, implemented through the NVIDIA BlueField DPU on each server. The DPU operates in zero trust mode: the host operating system cannot configure the DPU directly. It remains owned and controlled by the service provider, while the host is handed to the tenant. This means that even if a tenant's workload or OS is compromised, the network and management planes stay secure and isolated, a meaningful improvement over conventional bare-metal multi-tenancy.

The result is a network isolation model that is both more secure and dramatically simpler to operate at scale.

Where Rafay Comes In

NICo provides the hardware automation layer. Rafay sits on top of NICo's host-based networking model and uses it as the foundation for delivering multi-tenancy at scale — without the operational overhead that traditionally limits how many tenants a team can serve.

Bare-metal provisioning. Rafay leverages NICo's provisioning APIs to automate the full node lifecycle from zero-touch discovery and hardware validation through OS imaging and tenant delivery. What previously required manual intervention or fragmented scripts is now a fully automated, repeatable workflow triggered directly from the Rafay platform.

Self-service provisioning. Rafay abstracts NICo's APIs into simple workflows. A developer requests a GPU environment; Rafay triggers the NICo workflow to provision and deliver a ready-to-use node, fully isolated at the host network layer — no manual operator steps, no switch changes.

Standardized service SKUs. Operators define SKUs that encode server configuration, OS image, networking, and security controls. Because tenant network isolation is handled by the DPU rather than the switch, those SKUs are faster to deliver and easier to replicate consistently across tenants.

Enterprise governance. Rafay adds RBAC, resource quotas, and audit logging across the entire fleet. Every provisioning event is tracked. Only authorized users can access specific resources — enforced at both the platform and the network layer.

Cluster assembly. Once a node is provisioned and network-isolated, Rafay can automatically install the Kubernetes or SLURM stack, GPU drivers, and AI software needed to start work immediately.

Rafay as Part of the NVIDIA AI Cloud-Ready ISV Validation Initiative

Building on this foundation, Rafay is a validated ISV under the NVIDIA AI Cloud-Ready ISV Validation Initiative, where solutions are assessed against the NCP Software Reference Guide across networking, compute, orchestration, and AI platform layers.

This initiative brings together infrastructure and platform software providers aligned to NVIDIA’s reference architecture, ensuring solutions are validated for production-scale AI factories, real workloads, and consistent deployment at scale.

Within this ecosystem, Rafay serves as the orchestration and consumption layer that turns validated infrastructure into secure, multi-tenant, self-service AI services.

‍Learn more about NVIDIA's ISV Validation Programme ‍‍

Summary

The hardest part of building an AI infrastructure platform is operationalizing hardware at scale. Rafay's integration with the NVIDIA NCX Infrastructure Controller makes this tractable combining NICo's host-based networking and lifecycle automation with Rafay's orchestration and governance layer to deliver secure, scalable multi-tenancy without the complexity that has historically made bare-metal GPU services difficult to operate.

For more information, see the NVIDIA NCX Infrastructure Controller documentation.

‍

Share this post

Want a deeper dive in the Rafay Platform?

Book time with an expert.

Book a demo

Tags:

NVidia

Bare Metal and VM based Environments

AI Workloads

AI Applications

You might be also be interested in...

Product

Rafay and NVIDIA DSX OS: Turning Open-Source Components into a Consumable AI Cloud

AI factory operators have solved the GPU capacity question. The harder one is turning that capacity into production AI services. Rafay integrates NVIDIA DSX OS to ship it as a consumable AI cloud.

Read Now

No items found.

News

Rafay and Dell Technologies Forge a Faster Path to Production AI

Dell and Rafay are forging a faster path to production AI by delivering a powerful solution to help enterprises, telcos and neoclouds to build and scale sovereign AI platforms with confidence. With a full-stack approach and automation at its core, this joint offering supports innovation while ensuring operational control, compliance, data sovereignty and rapid ROI.

Read Now

No items found.

Product

Automated GPU Health Monitoring with NVIDIA NVSentinel on the Rafay Platform

Every GPU node monitored. Faulty nodes automatically quarantined and remediated. The Rafay Platform and NVIDIA NVSentinel make that a fleet-wide guarantee, not a per-cluster aspiration.

Read Now

No items found.