Technical Support Engineer - Remote
About Rafay Systems
Rafay is redefining how enterprises and GPU cloud providers deploy, manage, and scale modern applications. Our platform delivers self-service workflows, multi-cluster orchestration, and end-to-end life cycle management to Kubernetes and cloud-native infrastructure—empowering platform teams to operate with speed, security and efficiency at scale. As we grow, we’re looking for a Technical Support Engineer who thrives on solving complex distributed systems problems and is passionate about delivering a world-class support experience.
Role Summary
This is a deeply technical, hands-on support role focused on diagnosing and resolving real-world production issues in Kubernetes environments. This is not a ticket triage role—you’ll be expected to own problems end-to-end.
You’ll work directly with enterprise customers running mission-critical workloads, acting as a technical escalation point across Kubernetes control planes, cluster lifecycle operations, networking, and cloud infrastructure. You’ll collaborate closely with Engineering and SRE teams to debug issues, identify root causes, and drive resolution—not just workaround symptoms.
This role offers a unique opportunity to work at the cutting edge of Kubernetes, cloud infrastructure, and AI/ML platform management, while collaborating with our Customer Success and Engineering teams to ensure successful customer outcomes.
Key Responsibilities
· Own and resolve advanced technical support cases involving multi-cluster Kubernetes deployments, cluster provisioning failures, and workload runtime issues across public/private clouds
· Perform deep troubleshooting using tools like kubectl, cluster logs, events, and metrics to diagnose issues across control plane and data plane components
· Debug and support cluster lifecycle management workflows including provisioning, upgrades, scaling, and recovery.
· Analyze issues related to networking (CNI), ingress, DNS, service mesh, and storage (CSI) in Kubernetes environments
· Reproduce complex customer issues in internal environments and identify root cause with precision
· Act as a trusted customer advocate—proactively identifying risks and working cross-functionally to resolve them. Collaborate with Engineering to escalate bugs, validate fixes, and improve product reliability
· Provide clear, concise, and technically accurate communication to customers during incident resolution
· Contribute to runbooks, troubleshooting guides, and knowledge base articles
· Stay up to date on Rafay platform features, releases, and cloud-native ecosystem updates.
· Participate in on-call rotations to support critical customer incidents
Required Qualifications
· 5+ years of experience in Technical Support, SRE, or DevOps roles supporting production environments
· Strong hands-on experience managing and troubleshooting Kubernetes clusters in production
· Deep expertise with Kubernetes architecture, container orchestration technologies and debugging techniques
· Proven ability to troubleshoot Pod lifecycle issues, Cluster networking (DNS, Routing, Firewalls etc.), Storage, Helm deployments and Node-level issues
· Strong understanding of cloud platforms: AWS, GCP, or Azure and virtualization technologies (vSphere, OpenStack)
· Solid fundamentals in Linux systems, networking (TCP/IP, DNS), and distributed systems
· Experience working in customer-facing roles, handling escalations and high-severity incidents, experience with support tools like Zendesk.
· Excellent written and verbal communication skills
· Proven ability to work independently in fast-paced, dynamic environments.
· Bachelor’s degree in computer science or related field (or equivalent practical experience).
Preferred Qualifications
· CKA (Certified Kubernetes Administrator) or equivalent hands-on expertise
· Experience with Kubernetes ecosystem tools such as Helm, Prometheus, Grafana, and Terraform
· Familiarity with multi-cluster management and GitOps workflows
· Experience supporting enterprise SaaS platforms or developer infrastructure products
· Exposure to AI/ML infrastructure or GPU-based workloads is a plus
Why Join Rafay?
Rafay is at the forefront of cutting-edge cloud-native and GPU PaaS technologies and on a mission to modernize infrastructure for the next generation of enterprise applications—cloud-native, AI/ML-driven, and highly scalable. We offer:
· A front-row seat to foundational innovations in cloud-native and GPU PaaS technologies.
· A collaborative, fast-paced work environment with opportunities to grow and lead.
· Competitive compensation, comprehensive benefits, and attractive stock options.
· A culture focused on learning, ownership, and technical excellence








