The Kubernetes Current Blog

Introducing Serverless Inference: Team Rafay’s Latest Innovation

The GenAI revolution is in full swing, and for NVIDIA Cloud Partners (NCPs), GPU Cloud Providers (aka GPU Clouds), and Sovereign Cloud operators, it presents a significant opportunity. To keep up with market demands, NCPs and GPU Clouds are looking to go beyond simply selling GPUs—they are looking for ways to deliver managed GenAI services to help customers speed up AI adoption, while boosting their operating margins.

However, building and managing a large-scale GenAI services platform is a massive undertaking that requires large R&D teams, deep expertise in AI use cases, and a highly specialized platform engineering team that can operate a complex, multi-tenant compute infrastructure. Without the right foundation (pun intended), scaling these services profitably and reliably becomes a major challenge.

That’s where Rafay comes in.

The Rafay Platform is designed for NCPs and GPU Clouds to deliver GenAI services to their end customers as part of a multi-tenant, Platform-as-a-Service experience, complete with self-service consumption of compute and AI applications by developers. As highlighted in a recent press release, NCPs and GPU Clouds can now deliver Serverless Inference as a turnkey service at no additional cost to their customer base, enabling developers to build and scale AI applications fast, without dealing with the cost and complexity of building automation, governance, and controls that are essential for a GPU-based infrastructure.

The global AI inference market is on a fast growth trajectory, expected to surge from $106.15 billion in 2025 to $254.98 billion by 2030, fueled by a robust CAGR of 19.2%, according to MarketsandMarkets™. Any NCP or GPU Cloud that partners with Rafay to deliver Serverless Inference to their region will immediately benefit from the growing need for inference-focused offerings and will be able to improve their topline by delivering a high-value service to their customer base.

 

What is Rafay’s Serverless Inference offering?

Developers worldwide love the simplicity and use of Amazon Bedrock™, but they want to consume such a service in their region, within their country’s sovereign borders. Rafay empowers NCPs and GPU Clouds to offer their own “Amazon Bedrock™-like” service such that developers get instant access to a wide range of high-performing foundation models (FMs) through a unified, OpenAI-compatible API. Consumption can be carried out on demand, and all usage is metered using tokens and time.

Whether delivered as a multi-tenant service for thousands of users or dedicated endpoints for enterprises demanding privacy, Rafay’s Serverless Inference offering is a new monetization engine for NCPs and GPU Clouds.

 

 

Rafay Serverless Inference: What’s In It For Developers?

Rafay Serverless Inference transforms how NCPs and GPU Clouds deliver inference-as-a-service. With OpenAI-compatible APIs, SLA-backed performance, and intuitive operational workflows, Serverless Inference is built to simplify GenAI adoption at scale. Serverless Inference delivers the on-demand consumption experience that developers and data scientists are looking for, while giving enterprises the comfort that their workloads and data are confined within their country’s sovereign borders.

With Serverless Inference, developers enjoy:

Instant Access – Start using powerful open-source LLMs immediately through an OpenAI-compatible API—no setup required.

Zero Code Migration – Applications already integrated with OpenAI APIs require no code changes.

No Infrastructure Management – Get access to scalable inference without provisioning or managing hardware.

Wide Model Choice – Select from a growing catalog of high-performing models like LLaMA 3.2, Qwen, DeepSeek, and more.

Flexible Consumption Models – Use shared multi-tenant endpoints or request dedicated instances with enterprise-grade SLAs.

Token-Based Licensing – Pay only for what you use, based on input/output token usage.

Real-Time Visibility – Monitor usage and costs in real time, with access to historical trends.

Security and Control – All access is encrypted (HTTPS) and authenticated (bearer tokens), with audit logs and token revocation available.

These capabilities empower developers and data scientists to integrate GenAI capabilities into their applications with maximum velocity, while giving NCPs and GPU Clouds the confidence to scale with transparency, governance, and control. For NCPs and GPU Clouds, Rafay Platform simplifies and accelerates the delivery of GenAI capabilities for platform, engineering, and infrastructure teams.

 

Rafay Serverless Inference: What’s In It For NCPs & GPU Clouds?

Seamless Integration for Developers

  • OpenAI-compatible APIs
  • Zero infrastructure provisioning
  • Secure, RESTful, streaming-ready endpoints

Intelligent Infrastructure Management

  • Auto-scaling GPU nodes
  • Right-sized GPU allocation per model
  • Multi-tenant and dedicated isolation

Built-in Metering & Billing Integration

  • Token-based usage tracking (input/output)
  • Cost metering APIs for billing platforms
  • Historical and real-time usage dashboards

Enterprise-Ready Security & Governance

  • HTTPS-only API endpoints
  • Bearer token authentication with rotation
  • Access logging and audit trails
  • Token quotas per team/business unit/app

Observability, Storage & Performance Monitoring

  • Logs and metrics archived in the provider’s storage namespace
  • Support for backends like MinIO, Weka, and more
  • Centralized credential management

 

PaaS: Why Build It Yourself When You Don’t Have To?

The hard truth about infrastructure management and orchestration is that it all looks easier than it really is. There may be engineers on your team who may be thinking, “All we have to do is spin up vLLM, maybe integrate TensorRT-LLM, plug in our models, and we’re off to the races.”

But here’s what that path in reality looks like.

First, you’ll need to assemble a team of specialists—people who deeply understand GPUs, inference tuning, and distributed systems. Then comes the long, intricate process of optimizing performance: adjusting model parameters, minimizing latency, and ensuring token throughput doesn’t choke under production loads.

After that? You’ll have to build your own scaling logic. Create dashboards. Add token metering. Set up secure access controls. Archive logs. Monitor everything. And then maintain it—all day, every day, under pressure to meet SLAs your business depends on.

The reality is, DIY platforms turn into a huge time and money sink—not a differentiator.

Now, imagine skipping all that and launching services in weeks with tiny engineering teams. That’s what Rafay brings to the table.

 

Take the Next Step

Author

Trusted by leading companies