Solutions Architect - Toronto, Canada
About the Role
We are seeking a Solutions Architect to help customers successfully deploy, operate, and scale AI/ML workloads on our GPU Platform-as-a-Service (PaaS) offering. In this customer-facing role, you will work closely with platform engineering, MLOps, data science, and infrastructure teams to design and implement production-ready AI infrastructure solutions built on Kubernetes and GPU-accelerated environments.
You will help customers onboard to the platform, optimize workload performance, automate infrastructure, and ensure reliable operations while serving as a trusted technical advisor throughout the customer lifecycle.
Responsibilities
- Partner with customer platform, MLOps, and data science teams to understand AI/ML workload requirements and translate them into scalable platform architectures.
- Design and deploy Kubernetes-based solutions for model training, fine-tuning, and inference workloads.
- Assist customers with onboarding and implementation of the GPU PaaS platform across cloud and hybrid environments.
- Configure networking, identity management, observability, and security integrations with enterprise systems.
- Build and maintain automation assets including Terraform modules, Helm charts, GitOps workflows, and CI/CD pipelines.
- Monitor and troubleshoot production environments, including GPU utilization, workload performance, cluster health, and cost efficiency.
- Support root cause analysis and remediation efforts for customer issues.
- Serve as a technical advisor and day-to-day point of contact for assigned customers.
- Document best practices and provide feedback to Product and Engineering teams to improve platform capabilities.
- Collaborate with internal teams to ensure successful customer adoption and expansion.
Required Qualifications
- 4+ years of experience in Solutions Architecture, DevOps, Platform Engineering, Site Reliability Engineering (SRE), Cloud Engineering, or related fields.
- Strong hands-on experience with Kubernetes in production environments.
- Experience with at least one programming language such as Python or Go.
- Experience with AWS, Azure, or GCP, including networking, IAM, and managed Kubernetes services.
- Knowledge of Infrastructure as Code and automation tools such as Terraform, Helm, GitOps, and CI/CD platforms.
- Familiarity with monitoring and observability technologies including Prometheus, Grafana, OpenTelemetry, or similar.
- Understanding of AI/ML infrastructure concepts including GPU-based workloads, model serving, training pipelines, and resource optimization.
- Strong troubleshooting, communication, and customer-facing skills.
Preferred Qualifications
- Experience supporting enterprise customers in cloud-native environments.
- Familiarity with AI/ML frameworks such as PyTorch and TensorFlow.
- Experience with GPU scheduling, autoscaling, and workload optimization.
- Understanding of multi-tenant Kubernetes environments and platform operations.
- Experience working with MLOps or AI infrastructure platforms.
Why Join Rafay?
Rafay is at the forefront of GPU PaaS technologies and Kubernetes and we offer unique opportunities to join a winning team working on foundational technology for cloud and AI/ML services and enterprises. We work in a collaborative environment that rewards creative thinking and provides opportunities to advance professional careers in advanced technology development. On top of this we offer a fun and dynamic work environment, a competitive salary, robust benefits and attractive stock options. As the first of our kind, we are truly in a class of our own.








