Rafay and NVIDIA DSX OS: Turning Open-Source Components into a Consumable AI Cloud

May 30, 2026

AI factory operators are past the GPU capacity question. The harder one now is how to turn that capacity into production AI services that downstream tenants actually consume, with the lifecycle automation, runtime consistency, multi-tenant operations, and platform services required to deliver them reliably at scale. NVIDIA DSX OS provides the open-source software building blocks for this transition. The Rafay Platform integrates those building blocks into a system operators ship to customers as a consumable AI cloud.

What DSX OS Brings to the AI Factory

NVIDIA DSX OS is open-source, modular software for building, operating, and scaling AI factory infrastructure. As part of the NVIDIA DSX platform, it delivers composable components for lifecycle management, runtime consistency, health automation, resiliency, multi-tenant operations, and AI platform services, designed to integrate into partner control planes, infrastructure platforms, and AI cloud service stacks.

DSX OS was built from NVIDIADGX Cloud operating experience and is now released as open source for the partner ecosystem. The Rafay Platform integrates DSX OS across two component families: provisioning and multi-tenant operations, and intelligent scheduling and platform services. Together, these integrations close the loop from racked hardware to the consumption layer operators sell.

Provisioning and Multi-Tenant Operations: NICo and AICR

At AI factory scale, provisioning is not a one-time setup. It is a continuous workflow. Nodes cycle through tenant assignments, hardware gets replaced, software stacks evolve, and every transition has to be auditable, secure, and reversible. Manual provisioning and human-managed isolation do not survive the operational tempo of a multi-tenant GPU cloud.

NVIDIA Infra Controller (NICo) makes this layer programmable, with API-driven bare-metal lifecycle management and hardware-enforced tenant isolation through NVIDIA BlueField DPUs and NVIDIA DOCA Platform Framework (DPF). NVIDIA AI Cluster Runtime (AICR) complements NICo by capturing validated runtime configurations as version-locked recipes, eliminating the configuration drift that causes silent failures across large fleets.

Rafay is integrating NICo into its platform to automate bare-metal lifecycle management, secure tenant transitions, and multi-tenant AI factory operations across large-scale GPU environments. Operators using the Rafay Platform stand up the full data processing unit (DPU) hardware and software stack in a single step, with reusable templates, GitOps-driven consistency from staging to production, and automated lifecycle and policy governance applied across every DPU node. The result is secure multi-tenancy and zero-trust isolation enforced in hardware, deployed at scale.

Validated Under the NVIDIA AI Cloud Ready Initiative

The Rafay Platform is also among the first ISV solutions validated under the NVIDIA AI Cloud-Ready ISV Validation Initiative, tested against NVIDIA's published software architecture reference and covering the workload orchestration and AI platform layers of NVIDIA's ISV-NCP Validation Suite. This validation aligns with AICR's intent: the runtime stack operators put into production is one NVIDIA has tested end-to-end. The combined effect for operators is provisioning that is programmable from day one and a runtime stack that does not drift. 

Scheduling and Platform Services: KAI Scheduler, NVIDIA Run:ai, Dynamo, Grove, and NVCF

GPU access is necessary but not sufficient for AI services. Workloads need topology-aware scheduling, distributed inference serving, and production-grade APIs that downstream tenants can build against. Operators trying to assemble this layer themselves end up stitching together open-source projects, building custom APIs, and managing multi-tenancy at the application boundary. That work is expensive, fragile, and a poor fit for a service operators want to monetize.

DSX OS gives operators a coherent set of components for this layer. KAI Scheduler and NVIDIA Run:ai provide GPU-aware workload placement with fractional allocation and hierarchical quotas. NVIDIA Dynamo and NVIDIA Grove deliver distributed inference serving with disaggregated prefill and decode and per-stage autoscaling. NVIDIA Cloud Functions (NVCF) ties the stack together with unified APIs across inference, fine-tuning, and batch workloads, with multi-tenancy built in.

Rafay is integrating NVIDIA Cloud Functions as part of its platform to deliver self-hosted inference services with a unified API, multi-tenant operations, and scalable deployment across GPU infrastructure. 

The other components sit alongside this integration: KAI Scheduler and Run:ai handle the GPU scheduling layer, Dynamo and Grove deliver the serving runtime, and the Rafay Platform exposes the full surface as consumable services that operators offer to downstream enterprise tenants. The outcome is straightforward: operators stand up inference, fine-tuning, and batch services on validated NVIDIA components, with the consumption layer, governance, and tenant management handled by Rafay.

Why Both Integrations Matter Together

Provisioning and platform services are the same workflow at two altitudes. NICo with NVIDIA BlueField and NVIDIA DOCA Platform Framework and AICR put racked hardware into production with auditable tenancy and consistent runtime. NVCF, Dynamo, Grove, KAI Scheduler, and Run:ai turn that production capacity into AI services with the APIs, scheduling, and serving substrate tenants need. NVIDIA DSX OS delivers the open-source building blocks. The Rafay Platform ships them as a consumable AI cloud.

For operators, the integration shortens the path from infrastructure investment to customer-facing service, improves reliability through validated NVIDIA components, accelerates deployment with productized workflows, and delivers production AI services at scale, on a single platform rather than a stack of bespoke integrations.

Where This Leads

NVIDIA DSX OS delivers the open-source software stack for the AI factory. The Rafay Platform delivers the orchestration, multi-tenancy, and consumption layer that makes it consumable. AI factory operators bring the infrastructure, the customer relationships, and the regulatory footprint that turns it into a business. Rafay's integration across both DSX OS component families gives operators a single platform path to adopt NVIDIA's open-source agentic AI infrastructure software without taking on the engineering burden of stitching it together themselves. The shared goal is straightforward: lower the cost of intelligence and accelerate the delivery of production AI services across the ecosystem.

Share this post

Want a deeper dive in the Rafay Platform?

Book time with an expert.

Book a demo
Tags:

You might be also be interested in...

Product

Automated GPU Health Monitoring with NVIDIA NVSentinel on the Rafay Platform

Every GPU node monitored. Faulty nodes automatically quarantined and remediated. The Rafay Platform and NVIDIA NVSentinel make that a fleet-wide guarantee, not a per-cluster aspiration.

Read Now

Product

The Telco AI Imperative: From Connectivity to Sovereign AI Infrastructure

The AI buildout demands exactly what telcos already have, now is the moment to make that infrastructure count.

Read Now

Product

NVIDIA Dynamo: Turning Disaggregated Inference Into a Production System

Discover how NVIDIA Dynamo turns disaggregated inference into a production-ready system, enabling scalable, efficient AI services with better resource utilization and operational control.

Read Now