Back

Part 2: Self-Service Fractional GPU Memory with Rafay GPU PaaS

November 2, 2025

No items found.

In Part-1, we explored how Rafay GPU PaaS empowers developers to use fractional GPUs, allowing multiple workloads to share GPU compute efficiently. This enabled better utilization and cost control — without compromising isolation or performance.

In Part-2, we will show how you can enhance this by provide users the means to select fractional GPU memory. While fractional GPUs provide a share of the GPU’s compute cores, different workloads have dramatically different GPU memory needs. With this update, developers can now choose exactly how much GPU memory they want for their pods — bringing fine-grained control, better scheduling, and cost efficiency.

Why Specify GPU Memory Instead of Just a Fraction?

Traditionally, fractional GPUs divide a GPU into slices (e.g., ¼, ½, etc.), assuming proportional memory distribution. However, real-world workloads don’t always scale linearly with GPU memory or compute.

1. Different Workloads, Different Memory Needs

A Stable Diffusion inference job might run comfortably within 2–5 GB of GPU memory.
A small language model (e.g., a 1B-parameter LLM) could require 10–15 GB.
A fine-tuning or training job might need 20 GB or more, even if the compute load remains moderate.

By letting users explicitly select GPU memory, Rafay GPU PaaS helps decouple memory allocation from compute fraction, ensuring that each workload gets exactly what it needs — no more, no less.

2. Better Resource Efficiency

Without fractional memory selection, administrators often over-allocate GPU memory simply to avoid out-of-memory (OOM) errors. For example, a Nvidia H100 GPU has ~80GB memory. So, a 25% GPU fraction is 20GB memory which may be way too much for many use cases.

This leads to wasted GPU memory and stranded capacity. By allowing memory to be specified directly:

GPU memory can be more evenly shared across users.
Smaller workloads can pack more efficiently onto a single GPU.
Cluster GPU utilization can increase dramatically, improving ROI on expensive hardware like A100s and L40s.

3. Predictable Cost and Performance

Developers can now see the impact of memory choices in real time. Rafay updates the cost estimate dynamically based on selected CPU, memory, and GPU Fraction (memory). For example, selecting a 2GB GPU Fraction results in a cost of about $0.20/hour, scaling up predictably for larger allocations.

GPU Fraction	Estimated Cost	Ideal For
2GB	$0.20/hr	Lightweight inference, preprocessing
5GB	$0.45/hr	Image or small transformer models
10GB	$0.90/hr	Mid-size AI inference workloads
15GB+	$1.30+/hr	Fine-tuning, compute-intensive jobs

Transparency helps developers optimize both budget and performance before deployment.

User Self Service Workflow

The developer experience remains as seamless as ever:

Go to Developer Pods → New Developer Pod.
Select fractional-memory-gpu-dev-pod compute profile.
Choose CPU, memory, and GPU memory size.
Review cost estimate and deploy.

Rafay takes care of provisioning, isolation, scheduling, and lifecycle management — letting developers focus on building models and running workloads, not infrastructure tuning.

Summary

In Part-1, we introduced fractional GPU compute, where users can allocate and consume a portion of GPU cores rather than an entire physical GPU. In Part-2, we expanded the capability with fractional GPU memory selection — giving developers precise control over how much GPU memory their workloads consume, independent of compute fraction.

In Part-3, we will show you how you can enhance the self service experience by providing users with the option to pay more for priority access to shared GPU resources.

Share this post

Want a deeper dive in the Rafay Platform?

Book time with an expert.

Book a demo

Tags:

Fractional GPUs

You might be also be interested in...

Product

Self-Service Fractional GPUs with Rafay GPU PaaS

This is Part-1 in a multi-part series on end user, self service access to Fractional GPU based AI/ML resources.

Read Now

No items found.

Choosing the Right Fractional GPU Strategy for Cloud Providers

As demand for GPU-accelerated workloads soars across industries, cloud providers are under increasing pressure to offer flexible, cost-efficient, and isolated access to GPUs.

Read Now

No items found.

Demystifying Fractional GPUs in Kubernetes: MIG, Time Slicing, and Custom Schedulers

As GPU acceleration becomes central to modern AI/ML workloads, Kubernetes has emerged as the orchestration platform of choice.

Read Now

No items found.