Running GPU Infrastructure on Kubernetes: What Enterprise Platform Teams Must Get Right
Scaling GPUs on Kubernetes is a governance problem, where utilization, cost control, and access define success.
Read Now
.png)
Infra operators managing GPU-enabled Kubernetes clusters often need a fast and secure way to validate GPU visibility, driver health, and runtime readiness without exposing the cluster directly or relying on bastion hosts, VPNs, or manually managed kubeconfigs.
With Rafay's zero trust kubectl, operators can securely access remote Kubernetes resources and execute commands inside running pods from the Rafay platform. A simple but powerful example is running nvidia-smi inside a GPU Operator pod to confirm that the NVIDIA driver stack, CUDA runtime, and GPU devices are functioning correctly on a remote cluster.
In this post, we walk through how infra operators can use Rafay's zero trust access workflow to run nvidia-smi on a remote GPU-based Kubernetes cluster.

GPU validation is a routine part of day-2 operations for platform teams. It is especially useful after:
Rather than giving operators broad direct access to every cluster, Rafay provides a centralized and secure access path for common operational tasks like these.
In this workflow, you will:
gpu-operator-resources namespacenvidia-dcgm-exporter podnvidia-smiThis gives you a quick way to confirm GPU visibility and validate that the NVIDIA software stack is functioning correctly on the remote cluster.
Start by logging in to your Rafay organization.
This opens the cluster detail page, where you can inspect nodes, resources, and cluster operational state.
Inside the cluster page, click the Resources tab.
In the resource browser:
gpu-operator-resourcesThis namespace contains resources deployed by the NVIDIA GPU Operator.
Next, click Pods under the deployment resources section. You should see the GPU Operator-related pods for the cluster.

nvidia-dcgm-exporter podFrom the pod list, locate the pod named similar to:
nvidia-dcgm-exporter-<suffix>Make sure the pod is in a Running state. This pod is part of the NVIDIA GPU Operator stack and is a useful location for verifying GPU visibility from within the cluster runtime environment.
Click the Actions menu alongside the nvidia-dcgm-exporter pod. From the available options, select Exec.
Rafay will open a remote shell session into the container using its zero trust kubectl access path. This allows operators to securely interact with the running workload without requiring direct cluster exposure.
nvidia-smiOnce the shell is open, run:
nvidia-smiThis command queries the NVIDIA driver stack and returns details about the GPU devices visible to the container.
A successful nvidia-smi response typically shows:
In the example shown here, the output confirms:
This is often the quickest way to validate that the node and runtime are ready for GPU-backed workloads.
kubectl for thisThis workflow highlights several practical benefits for infra and platform teams.
Operators can run troubleshooting and validation commands without exposing the Kubernetes API publicly or distributing long-lived cluster credentials broadly.
When a workload cannot detect GPUs, operators can quickly inspect GPU-related pods and verify the environment from inside the cluster.
Teams can use the same operational pattern across clusters and environments, which simplifies day-2 management.
Before onboarding AI or ML workloads, platform teams can confirm that GPU nodes are correctly configured and visible to the Kubernetes runtime.
For infra operators, validating GPU readiness should be fast, secure, and repeatable.
With Rafay zero trust kubectl, operators can open a shell into a remote Kubernetes pod and run a simple command like:
nvidia-smi
This workflow is useful for scenarios such as:

Scaling GPUs on Kubernetes is a governance problem, where utilization, cost control, and access define success.
Read Now

GOpen-source momentum, driven in part by NVIDIA, is pushing GPUs into Kubernetes as native resources, with advances in allocation, scheduling, and isolation.
Read Now
.png)
A deep dive into OpenClaw as a gateway-centric AI runtime and how platform teams can deploy, secure, and scale it as a governed service on Kubernetes.
Read Now