Run nvidia-smi on Remote GPU Kubernetes Clusters Using Rafay Zero Trust Access

March 9, 2026

Infra operators managing GPU-enabled Kubernetes clusters often need a fast and secure way to validate GPU visibility, driver health, and runtime readiness without exposing the cluster directly or relying on bastion hosts, VPNs, or manually managed kubeconfigs.

With Rafay's zero trust kubectl, operators can securely access remote Kubernetes resources and execute commands inside running pods from the Rafay platform. A simple but powerful example is running nvidia-smi inside a GPU Operator pod to confirm that the NVIDIA driver stack, CUDA runtime, and GPU devices are functioning correctly on a remote cluster.

In this post, we walk through how infra operators can use Rafay's zero trust access workflow to run nvidia-smi on a remote GPU-based Kubernetes cluster.


Why this workflow matters

GPU validation is a routine part of day-2 operations for platform teams. It is especially useful after:

  • provisioning a new GPU-backed cluster
  • deploying the NVIDIA GPU Operator
  • adding new GPU nodes
  • upgrading drivers or container runtime components
  • troubleshooting workloads that are not detecting GPUs

Rather than giving operators broad direct access to every cluster, Rafay provides a centralized and secure access path for common operational tasks like these.


What you will do

In this workflow, you will:

  1. Log in to your Rafay organization
  2. Open the Infrastructure Portal
  3. Navigate to the target Kubernetes cluster
  4. Open the cluster's Kubernetes resources
  5. Select the gpu-operator-resources namespace
  6. Open an exec session into the nvidia-dcgm-exporter pod
  7. Run nvidia-smi

This gives you a quick way to confirm GPU visibility and validate that the NVIDIA software stack is functioning correctly on the remote cluster.

Step 1: Log in to Rafay and open the target cluster

Start by logging in to your Rafay organization.

  • From the left navigation, go to Infrastructure,
  • Select the target GPU-enabled Kubernetes cluster card.

This opens the cluster detail page, where you can inspect nodes, resources, and cluster operational state.

Step 2: Navigate to Kubernetes resources

Inside the cluster page, click the Resources tab.

In the resource browser:

  • Set Show Resources By to Namespaces
  • Select the namespace gpu-operator-resources

This namespace contains resources deployed by the NVIDIA GPU Operator.

Next, click Pods under the deployment resources section. You should see the GPU Operator-related pods for the cluster.

Step 3: Locate the nvidia-dcgm-exporter pod

From the pod list, locate the pod named similar to:

nvidia-dcgm-exporter-<suffix>

Make sure the pod is in a Running state. This pod is part of the NVIDIA GPU Operator stack and is a useful location for verifying GPU visibility from within the cluster runtime environment.

Step 4: Open an Exec Session

Click the Actions menu alongside the nvidia-dcgm-exporter pod. From the available options, select Exec.

Rafay will open a remote shell session into the container using its zero trust kubectl access path. This allows operators to securely interact with the running workload without requiring direct cluster exposure.

Step 5: Run nvidia-smi

Once the shell is open, run:

nvidia-smi

This command queries the NVIDIA driver stack and returns details about the GPU devices visible to the container.

Deciphering the Output

A successful nvidia-smi response typically shows:

  • NVIDIA driver version
  • CUDA version
  • detected GPU model
  • temperature and power state
  • memory usage
  • GPU utilization
  • active GPU processes, if any

In the example shown here, the output confirms:

  • the cluster node can see an NVIDIA T1000 8GB GPU
  • the NVIDIA driver is installed and working
  • the CUDA stack is available
  • memory and utilization metrics are being reported correctly

This is often the quickest way to validate that the node and runtime are ready for GPU-backed workloads.

Why use Rafay zero trust kubectl for this

This workflow highlights several practical benefits for infra and platform teams.

1. Secure operational access

Operators can run troubleshooting and validation commands without exposing the Kubernetes API publicly or distributing long-lived cluster credentials broadly.

2. Faster GPU troubleshooting

When a workload cannot detect GPUs, operators can quickly inspect GPU-related pods and verify the environment from inside the cluster.

3. Consistent multi-cluster operations

Teams can use the same operational pattern across clusters and environments, which simplifies day-2 management.

4. Better readiness validation

Before onboarding AI or ML workloads, platform teams can confirm that GPU nodes are correctly configured and visible to the Kubernetes runtime.


Conclusion

For infra operators, validating GPU readiness should be fast, secure, and repeatable.

With Rafay zero trust kubectl, operators can open a shell into a remote Kubernetes pod and run a simple command like:

nvidia-smi

This workflow is useful for scenarios such as:

  • post-deployment GPU validation
  • troubleshooting GPU visibility issues
  • validating NVIDIA GPU Operator health
  • checking readiness before AI/ML onboarding
  • confirming node-level GPU access after upgrades or maintenance
Share this post

Want a deeper dive in the Rafay Platform?

Book time with an expert.

Book a demo
Tags:

You might be also be interested in...

News

Rafay Joins VAST Cosmos to Enable Governed GPU-Powered AI Services

Rafay has joined the VAST Cosmos Community as a Technology Partner, aligning its AI-native cloud control plane with the VAST AI Operating System to help organizations operationalize GPU-powered AI. Together, Rafay and VAST integrate governed compute orchestration and scalable data services, enabling NeoCloud providers and enterprises to transform raw infrastructure into consistent, production-ready AI platforms.

Read Now

What is Multi-Tenancy? A Guide to Multi-Tenant Architecture

Multitenancy is a model where teams share infrastructure while keeping data separate. Rafay delivers secure, scalable multi-tenant Kubernetes management.

Read Now

Product

How GPU Clouds Deliver NVIDIA Run:ai as Self-Service with Rafay GPU PaaS

Learn how Rafay GPU PaaS enables GPU Clouds to offer NVIDIA Run:ai as a fully automated, multi-tenant managed service delivered through self-service with lifecycle management and turnkey deployment.

Read Now