The Kubernetes Current Blog

Experience What Composable AI Infrastructure Actually Looks Like — In Just Two Hours

The pressure to deliver on the promise of AI has never been greater. Enterprises must find ways to make effective use of their GPU infrastructure to meet the demands of AI/ML workloads and accelerate time-to-market. Yet, despite making significant investments in GPU infrastructure, many organizations still struggle with low utilization, manual provisioning workflows, and limited access for the teams who need it most. 

The Rafay GPU Cloud workshop gives you the opportunity to experience how AI workloads should be delivered as a service on top of existing infrastructure. With minimal effort required from the customer, you will deploy a working environment preloaded with real-world AI workloads such as Jupyter notebooks, training pipelines, inference endpoints, and fine-tuning workflows.

Why Building a GPU Cloud Is So Difficult

The complexity of GPU infrastructure stems from being unable to bridge the gap between infrastructure and AI workflows.

Unlike traditional environments, GPU clouds need to handle a unique set of demands, some of which are listed below: 

  • There is no native way to enforce multi-tenant policies such as quotas, TTLs, or VPC isolation despite the fact that most GPU clouds serve multiple teams or business units.
  • AI workloads don’t just run in Kubernetes. They span SLURM, distributed training frameworks, and interactive notebooks, each with their own scheduling and networking requirements.
  • End User Self-service is essential. Without self-service, for example, ML experiment becomes a ticket, creating delays and frustration for developers and data scientists.
  • Costs are opaque. Most teams can’t tie GPU usage back to users, models, or workloads, making it impossible to control budget or optimize ROI.

These are just a few of the foundational gaps and they only scratch the surface of what is required to launch a GPU cloud at scale.

Combatting GPU Cloud Orchestration Challenges with a Workshop

The Rafay GPU Cloud Workshop is a low-friction, guided experience. Users simply provide a server with 2 GPUs in it. From there, the Rafay will configure a fully operational environment on their behalf, including example AI applications and workloads. A short walkthrough follows, allowing internal stakeholders to see the environment in action and explore it at their own pace.

For many early participants, the workshop has served as the catalyst to rethink how GPU infrastructure is operated and consumed. It has helped teams gain clarity on what a production-ready AI delivery model looks like, without requiring deep investment of time or people.

In just under 2 hours, attendees will:

  • Deploy a multi-tenant GPU Cloud Environment using industry best-practices 
  • Create end user SKUs for GPU based compute (SLURM, Kubernetes clusters, VMs etc) and AI/GenAI applications (notebooks, training clusters, inference etc) that end users can consume and use via a self service experience. 
  • Track and report usage metrics programmatically to generate billing/cost reports for users

It will be a working session tailored to attendee’s infrastructure goals, led by Rafay experts who have operationalized GPU environments for enterprise-scale workloads.

Why This Workshop Matters

Here’s the hard truth: if organizations are not providing self-service, secure, metered access to GPUs, AI infrastructure practitioners become ineffective. Meanwhile, the most expensive hardware in any stack sits underutilized–and accumulating interest faster than anyone prefers. 

The Rafay GPU Cloud Workshop is a great first step for infrastructure leaders and teams looking to grow in their AI infrastructure “maturity” without making time or people investments (as would be the case with a traditional POC). Since the workshop requires no monetary investment, it offers teams a faster, more concrete way to deliver secure, governed, and production-grade GPU access to those who need it most.

Learn more about Rafay’s approach in this NVIDIA blog: “How Rafay’s Self-Service Platform Delivers NVIDIA Accelerated Computing for Enterprise AI Workloads” and learn more about how Rafay is democratizing access in Rafay’s latest blog: “Democratizing GPU Access: How PaaS Self-Service Workflows Transform AI Development.” 

Author

Trusted by leading companies