Advancing GPU Scheduling and Isolation in Kubernetes

March 25, 2026

At KubeCon Europe 2026, NVIDIA made a set of significant open-source contributions that advance how GPUs are managed in Kubernetes. These developments span across: resource allocation (DRA), scheduling (KAI), and isolation (Kata Containers). Specifically, NVIDIA donated its DRA Driver for GPUs to the Cloud Native Computing Foundation, transferring governance from a single vendor to full community ownership under the Kubernetes project. The KAI Scheduler was formally accepted as a CNCF Sandbox project, marking its transition from an NVIDIA-governed tool to a community-developed standard. And NVIDIA collaborated with the CNCF Confidential Containers community to introduce GPU support for Kata Containers, extending hardware-level workload isolation to GPU-accelerated workloads. Together, these contributions move GPU infrastructure closer to a first-class, community-owned, scheduler-integrated model.



1. Dynamic Resource Allocation (DRA): Toward Scheduler-Aware GPUs

Kubernetes' Device Plugin framework has been the standard mechanism for exposing GPUs since v1.8. While widely adopted, it has limitations:

  • Scheduling is largely based on integer resource counts, with limited awareness of device attributes
  • Topology information (e.g., NUMA, NVLink) is not fully integrated into scheduling decisions
  • Sharing and allocation semantics are often implemented out-of-tree or vendor-specific

The Dynamic Resource Allocation (DRA) API introduces a more expressive model through:

  • DeviceClass – describes device capabilities
  • ResourceSlice – represents allocatable capacity
  • ResourceClaim / ResourceClaimTemplate – declarative workload requests

NVIDIA's DRA driver extends this model for GPUs, enabling:

  • Attribute-based scheduling aligned with device capabilities
  • Integration with MIG and time-slicing mechanisms
  • Better coordination between the scheduler and allocation lifecycle

This shifts GPU allocation toward a scheduler-visible, declarative workflow, enabling more precise placement and improved utilization.

2. KAI Scheduler: AI-Aware Scheduling Semantics

Distributed AI workloads introduce requirements that are not fully addressed by the default Kubernetes scheduler, including:

  • Gang scheduling for coordinated multi-pod execution
  • Fair sharing of GPUs across teams
  • Avoiding partial allocations that degrade job efficiency

The KAI Scheduler explores these requirements with:

  • Gang scheduling semantics for all-or-nothing placement
  • Hierarchical queues with Dominant Resource Fairness (DRF)
  • Support for sub-GPU allocation strategies, depending on device capabilities
  • Pre-scheduling simulation to reduce preemption overhead

This reflects a broader trend toward domain-specific schedulers that extend Kubernetes for AI/ML workloads.

3. Kata Containers: Strengthening GPU Multi-Tenancy

GPU multi-tenancy introduces isolation challenges, particularly in regulated or shared environments.

Kata Containers address this by running each pod inside a lightweight virtual machine:

  • Each workload runs in a dedicated microVM
  • GPUs are exposed via VFIO passthrough
  • Isolation is enforced at the hardware virtualization boundary

When combined with emerging hardware security capabilities, this provides a foundation for running sensitive workloads on shared GPU infrastructure with stronger isolation guarantees than standard containers.

From Upstream Capabilities to Platform Standards

While these projects introduce critical primitives, platform teams still need a way to standardize and operate them consistently across clusters.

In Rafay, this is achieved through Blueprints, versioned specifications that define cluster add-ons, policies, and configuration baselines. Blueprints act as the mechanism for turning upstream components into repeatable platform standards across GPU-enabled environments.

A GPU platform blueprint typically includes:

  • NVIDIA GPU Operator — driver lifecycle and GPU component management
  • KAI Scheduler — deployed as a managed add-on
  • Prometheus/NVIDIA DCGM Exporter — GPU observability
  • OPA Gatekeeper — policy enforcement
  • Network policies — namespace isolation
  • Kata Containers runtime — for stronger workload isolation

Blueprints are versioned and continuously reconciled, allowing platform teams to:

  • Enforce consistent configuration across clusters
  • Detect and remediate configuration drift
  • Support cluster-specific variations (e.g., MIG vs time-slicing) without duplicating definitions

This approach enables organizations to manage heterogeneous GPU fleets while maintaining a consistent operational model.

Key Takeaways

The developments presented at KubeCon EU 2026 reflect a broader shift in GPU infrastructure within Kubernetes:

  • From node-local, opaque resources → to scheduler-visible, attribute-rich resources
  • From ad hoc scheduling and sharing → toward structured, policy-aware allocation models
  • From best-effort isolation → toward stronger multi-tenant and security boundaries

For platform teams, the challenge is no longer just provisioning GPUs, but operationalizing them as governed infrastructure , spanning allocation, scheduling, and isolation across the Kubernetes control plane.

Share this post

Want a deeper dive in the Rafay Platform?

Book time with an expert.

Book a demo
Tags:

You might be also be interested in...

Product

How Rafay Turns NeoClouds and Telco AI Clouds into Token-Metered Revenue Engines

Learn how telcos and NeoClouds can turn sovereign AI infrastructure into token-metered services with Rafay, enabling inference APIs, billing, governance, and monetization.

Read Now

News

Rafay and Dell Technologies Forge a Faster Path to Production AI

Dell and Rafay are forging a faster path to production AI by delivering a powerful solution to help enterprises, telcos and neoclouds to build and scale sovereign AI platforms with confidence. With a full-stack approach and automation at its core, this joint offering supports innovation while ensuring operational control, compliance, data sovereignty and rapid ROI.

Read Now

Product

Why CNCF Kubernetes AI Conformance Matters and how Rafay Is Leading the Way

The CNCF Kubernetes AI Conformance program sets the industry standard for running AI workloads on Kubernetes. Rafay's MKS has achieved certification for v1.35, here's what the standard covers and why it matters for enterprises and neoclouds building on GPU infrastructure.

Read Now