The Kubernetes Current Blog

How Rafay Helps Sovereign & GPU Cloud Companies Accelerate Time to Market

The Generative AI (GenAI) gold rush is in full swing, and a new use case is fast emerging globally: Sovereign Clouds for AI workloads, a.k.a. GPU Clouds.

Why are GPU Clouds being born? It’s the data.

The most curated and abundant dataset will result in the most well-trained model, and the most well-trained model will win. Many countries are therefore beginning to view the prospect of their data leaving their sovereign borders as a national security issue. They are therefore investing in building GPU Clouds in-country to help speed up the development of AI applications.

Before this market settles, it’s certainly not inconceivable that hundreds – if not thousands – of GPU Clouds of varying computing capacity will emerge globally.

If you’re considering building a GPU Cloud of your own and have the monetary resources to do so, here are three key requirements you need to consider:

  1. Datacenter Essentials
    You need to invest in abundant power, cooling, connectivity, etc., to ensure that you can operate your GPU Cloud 24/7.
  2. GPUs
    You need a steady supply of GPUs to keep adding capacity for your future customers. Given serious supply constraints globally, the “if you build it, they will come” strategy seems to be working incredibly well for everyone in the GPU market. There are additional critical requirements such as storage and networking to ensure that your clusters are not facing any I/O impedance, which may slow down your pace of model training, etc.
  3. Platform-as-a-Service Layer for Accelerated Computing
    You need to build a multi-tenant, user-friendly system that allows developers and enterprise IT teams to reserve dedicated or shared resources where they can carry out their AI-related tasks. Making this layer work securely with guardrails and strong isolation is a massive project.
    To provide this layer, you’ll need to think about:

    • A self-service portal for customers to sign up and consume GPU-driven experiences on demand.
    • On-demand resource provisioning capabilities to scale up existing – or spin up new – Kubernetes clusters based on demand.
    • GPU virtualization capabilities to be able to deliver the service to a large cohort of users who don’t need a full GPU to do their jobs, but are going to be able to make do with an eighth of a GPU, for example.
    • Resource match-making capabilities to map new users into compute buckets such that there is minimal fragmentation of resources across your compute fleet.
    • A Sagemaker-like experience such that developers can spin Jupyter Notebooks for model creation, then are able to package their models as containers to deploy on Kubernetes clusters.
    • Managed services such as databases, pub-sub queues, etc., that users may need to operate their apps in this cloud.
    • Operations portal to manage all of the above via a single pane of glass

Rafay’s Platform as a Service solution for Accelerated Computing can help with all of the above requirements.

Rafay delivers the most comprehensive Platform-as-a-Service offering for Kubernetes-based compute on the market. With GenAI workloads primarily being run on Kubernetes clusters, all GPU Cloud providers need to deliver Kubernetes as a service to users, while also solving for user management, workload isolation, access control, policy enforcement, etc. Rafay customers are able to launch a fully functioning PaaS offering 4x faster than in-house development efforts and reduce TCO by an average of 66%.

If you’re building a GPU Cloud, don’t waste time and resources building a PaaS layer. Partner with Rafay and use the savings to buy more GPUs.

Reach out to us at [email protected] to learn more.

Author

Trusted by leading companies