Careers

Solutions Architect - Toronto, Canada

Full Time

Toronto, Canada

About the Role

We are seeking a Solutions Architect to help customers successfully deploy, operate, and scale AI/ML workloads on our GPU Platform-as-a-Service (PaaS) offering. In this customer-facing role, you will work closely with platform engineering, MLOps, data science, and infrastructure teams to design and implement production-ready AI infrastructure solutions built on Kubernetes and GPU-accelerated environments.

You will help customers onboard to the platform, optimize workload performance, automate infrastructure, and ensure reliable operations while serving as a trusted technical advisor throughout the customer lifecycle.

Responsibilities

Partner with customer platform, MLOps, and data science teams to understand AI/ML workload requirements and translate them into scalable platform architectures.
Design and deploy Kubernetes-based solutions for model training, fine-tuning, and inference workloads.
Assist customers with onboarding and implementation of the GPU PaaS platform across cloud and hybrid environments.
Configure networking, identity management, observability, and security integrations with enterprise systems.
Build and maintain automation assets including Terraform modules, Helm charts, GitOps workflows, and CI/CD pipelines.
Monitor and troubleshoot production environments, including GPU utilization, workload performance, cluster health, and cost efficiency.
Support root cause analysis and remediation efforts for customer issues.
Serve as a technical advisor and day-to-day point of contact for assigned customers.
Document best practices and provide feedback to Product and Engineering teams to improve platform capabilities.
Collaborate with internal teams to ensure successful customer adoption and expansion.

Required Qualifications

4+ years of experience in Solutions Architecture, DevOps, Platform Engineering, Site Reliability Engineering (SRE), Cloud Engineering, or related fields.
Strong hands-on experience with Kubernetes in production environments.
Experience with at least one programming language such as Python or Go.
Experience with AWS, Azure, or GCP, including networking, IAM, and managed Kubernetes services.
Knowledge of Infrastructure as Code and automation tools such as Terraform, Helm, GitOps, and CI/CD platforms.
Familiarity with monitoring and observability technologies including Prometheus, Grafana, OpenTelemetry, or similar.
Understanding of AI/ML infrastructure concepts including GPU-based workloads, model serving, training pipelines, and resource optimization.
Strong troubleshooting, communication, and customer-facing skills.

Preferred Qualifications

Experience supporting enterprise customers in cloud-native environments.
Familiarity with AI/ML frameworks such as PyTorch and TensorFlow.
Experience with GPU scheduling, autoscaling, and workload optimization.
Understanding of multi-tenant Kubernetes environments and platform operations.
Experience working with MLOps or AI infrastructure platforms.

Why Join Rafay?

Rafay is at the forefront of GPU PaaS technologies and Kubernetes and we offer unique opportunities to join a winning team working on foundational technology for cloud and AI/ML services and enterprises. We work in a collaborative environment that rewards creative thinking and provides opportunities to advance professional careers in advanced technology development. On top of this we offer a fun and dynamic work environment, a competitive salary, robust benefits and attractive stock options. As the first of our kind, we are truly in a class of our own.

Your application has been successfully submitted.

Oops! Something went wrong while submitting the form.