GPU/AI/ML FAQs

What exactly does Rafay do or provide around AI/ML or cloud-native adoption?

Rafay provides a Platform-as-a-Service (PaaS) solution that enables companies to create customized compute environments for developers and data scientists. It helps platform engineering teams deliver a user-friendly PaaS experience quickly, typically in weeks instead of years. Rafay’s platform enables faster development and deployment of new capabilities while maintaining necessary controls and guardrails. By simplifying the process of implementing complex platforms, Rafay reduces the need for large teams of experts. In essence, Rafay streamlines cloud-native and AI/ML adoption by offering a ready-to-use platform that balances speed, efficiency, and security for businesses.

Does Rafay offer a GPU PaaS?

Yes, Rafay provides a Platform-as-a-Service (PaaS) solution that supports both CPU-only and GPU-accelerated compute environments. Platform teams can quickly set up and deliver customized self-service experiences for developers and data scientists, typically within days or weeks. This flexible platform allows end-users to easily access the computational resources they need, whether it’s standard CPU processing or more powerful GPU capabilities. Rafay’s solution streamlines the deployment and management of diverse computing environments, making it easier for organizations to support a wide range of applications, from standard software to complex AI/ML projects.

What does Rafay offer for ML workbenches?

Rafay provides curated ML workbenches that offer developers and data scientists an experience similar to Amazon SageMaker or Google VertexAI, but at a more competitive price point. The platform includes out-of-the-box services such as Notebooks-as-a-Service, with pre-compiled environments featuring TensorFlow, PyTorch, and other popular libraries for immediate productivity. For those preferring a job-based model, Rafay offers Ray-as-a-Service, allowing data scientists to focus on their work without dealing with infrastructure complexities. Advanced teams can opt for a Kubeflow-based ML workbench, which manages pipelines, experiment tracking, and model repositories. These solutions enable data science teams to work efficiently with their preferred tools while Rafay handles the underlying infrastructure management.

What does Rafay offer for GenAI playgrounds?

Rafay provides a controlled, cost-effective Generative AI playground for organizations new to GenAI. This environment allows data scientists to train, tune, and serve GenAI models, enabling efficient experimentation and development without significant investment or infrastructure complexity. It’s ideal for businesses looking to explore GenAI capabilities while managing costs and maintaining control over their AI initiatives.

Who uses Rafay's platform for AI/ML initiatives?

Rafay’s AI/ML platform is utilized by various organizations, particularly in the financial services sector. We’re also collaborating with major GPU vendors for specialized use cases. A notable public example of a company using our AI/GPU stack is MoneyGram, a global leader in cross-border P2P payments and money transfers.

How does Rafay’s platform accelerate time-to-value for AI/ML projects?
  • Without Rafay, platform teams implement complex platforms internally over multiple years and with large teams of experts.
  • With Rafay, platform teams can deliver a finely tuned PaaS experience to internal users in weeks.
How does Rafay ensure compliance and governance for enterprise AI initiatives?

Rafay applies its proven governance and control features, originally developed for cloud-native projects, to AI/GPU initiatives. These capabilities include blueprinting, access management, chargebacks, and auditing/logging. This approach ensures that enterprises can maintain compliance and control over their AI projects, just as they do with other cloud-native initiatives. By leveraging these established features, Rafay helps organizations accelerate AI adoption while maintaining the necessary governance standards, ultimately leading to increased revenues and lower total cost of ownership for both cloud-native and AI/ML projects.

How does Rafay's platform streamline AI/ML infrastructure management for enterprise adoption?

Rafay enables enterprise platform teams to deliver a PaaS experience for GPU resources, both on-premises and in the cloud. The platform offers a cost-effective alternative to services like Amazon SageMaker or Google VertexAI, providing ML workbenches with similar functionality. Rafay’s self-service model and hierarchical experience sharing allow platform teams to selectively offer compute and ML workbench experiences to different teams, optimizing access to expensive GPU resources. Additionally, the platform includes chargeback capabilities to ensure fair cost allocation among internal teams. This comprehensive approach simplifies AI/ML infrastructure management, accelerating enterprise adoption while maintaining cost control and resource efficiency.

Does Rafay provide AI/ML workbenches and other tooling?

Yes, Rafay offers a comprehensive suite of AI/ML tools. The platform provides out-of-the-box workbenches based on Kubeflow and KubeRay, delivered as fully managed services. This allows users to access sophisticated AI/ML platforms without dealing with infrastructure complexities. Additionally, Rafay includes a low-code/no-code framework that enables partners to rapidly develop and deploy specialized AI solutions such as verticalized agents, co-pilots, and document translation services. This combination of ready-to-use workbenches and a flexible development framework streamlines the adoption and customization of AI/ML tools for various enterprise needs, accelerating time-to-market for new AI capabilities.

Is GPU Virtualization supported?

Yes, Rafay supports GPU virtualization. The platform enables GPU and Sovereign Cloud providers to offer fractional GPU resources to end users through a self-service interface. Rafay’s system manages key aspects of virtualization, including:

  1. Security measures
  2. Compute isolation
  3. Chargeback data collection
How does Rafay handle chargebacks and billing?

Rafay offers a comprehensive solution for chargebacks and billing. The platform collects granular chargeback information on resource usage, which can be easily exported to customers’ existing billing systems for further processing and distribution. Rafay allows for customizable chargeback group definitions to align with organizational structures or projects. Both group definition and data collection can be carried out programmatically, enabling efficient and accurate billing processes.

How is Rafay different from Run.AI?

Run:AI focuses on providing fractional/virtualized GPU consumption and a proprietary scheduler optimized for AI/GenAI workloads, replacing the default Kubernetes scheduler. Rafay, however, provides a more comprehensive platform that manages the full lifecycle of underlying Kubernetes clusters and environments. Rafay offers an out-of-the-box experience to deploy and consume Run:AI on Rafay’s GPU PaaS, while also providing its own GPU virtualization and AI-friendly Kubernetes scheduler for customers preferring a single-vendor solution. Essentially, Rafay can either complement Run:AI’s offerings or provide a standalone solution that covers similar functionalities along with broader infrastructure management capabilities, giving customers flexibility in their AI infrastructure choices.

Does Rafay support NVIDIA NIMs/NIM?

Yes, Rafay supports NVIDIA NIM (NVIDIA Inference Microservices). NIM is NVIDIA’s proprietary solution for delivering packaged inferencing capabilities. It comes pre-configured with NVIDIA’s in-house models and has been optimized for use with a wide range of open-source models, including Meta’s Llama variants. While NIM is often viewed as an alternative to the open-source kServe package, Rafay’s platform supports both NIM and kServe. This flexibility allows customers to choose their preferred inference endpoint and deploy it effortlessly on GPU instances using the Rafay platform. By supporting multiple inferencing solutions, Rafay enables organizations to leverage the most suitable tools for their specific AI/ML needs while maintaining a consistent and manageable infrastructure.

Why consider Rafay's solution over AWS SageMaker or Google Vertex AI?

While AWS SageMaker and Google Vertex AI offer fully managed services, Rafay’s Kubernetes and Kubeflow-based MLOps solution provides distinct advantages. It offers vendor agnosticism, allowing deployment across various cloud providers or on-premises, thus avoiding vendor lock-in. Rafay’s approach enables greater customizability, giving users more control over their infrastructure and workloads. It can also be more cost-efficient, as managing your own Kubernetes clusters allows for optimized resource utilization. This combination of flexibility, control, and potential cost savings makes Rafay’s solution appealing for organizations seeking a tailored and adaptable MLOps environment that can evolve with their specific needs and infrastructure preferences.

How does Rafay's solution fit into existing AWS/Google Cloud workflows?

Rafay’s MLOps platform is designed to seamlessly integrate with existing cloud ecosystems, including AWS and Google Cloud. The solution supports integration with various cloud services, allowing organizations to leverage their current investments and workflows. Rafay’s platform excels in hybrid and multi-cloud environments, providing a unified interface to manage MLOps workflows consistently across different infrastructures. This approach enables businesses to maintain their existing cloud relationships while gaining the added benefits of Rafay’s flexible, vendor-agnostic platform. By bridging the gap between different cloud environments, Rafay allows organizations to optimize their MLOps processes without disrupting established workflows, offering a smooth transition and enhanced capabilities for AI/ML initiatives.

Will managing Kubernetes and Kubeflow add complexity compared to fully managed services?

While Kubernetes and Kubeflow management can be complex, Rafay’s platform is specifically designed to simplify these processes. The solution addresses potential complexity in three key ways:

    1. User-Friendly Interface: Rafay provides an intuitive UI and automation tools that significantly reduce the complexity typically associated with Kubernetes.

 

    1. Managed Kubernetes Service: The platform offers managed Kubernetes services that handle cluster provisioning, scaling, and maintenance, allowing teams to focus on developing models rather than managing infrastructure.

 

    1. Expert Support: Rafay provides comprehensive support and documentation to help teams navigate any challenges, effectively reducing the learning curve.

 

This approach enables organizations to harness the power and flexibility of Kubernetes and Kubeflow without the added complexity.

What about the cost? Are there hidden expenses in managing our own infrastructure?

Rafay aims to provide transparent and potentially cost-saving solutions for managing AI/ML infrastructure. The platform addresses cost concerns in three key areas:

    1. Transparent Pricing: Rafay offers clear pricing models without hidden fees that can be associated with fully managed services.

 

    1. Cost Control: By managing your own infrastructure through Rafay, you can optimize resource usage and avoid over-provisioning, potentially leading to significant cost savings.

 

    1. Avoiding Vendor Premiums: Fully managed services often come with a premium for convenience. Rafay enables you to balance convenience and cost effectively.

 

This approach allows organizations to have greater control over their infrastructure costs while still benefiting from the ease of use provided by Rafay’s platform.

What's Rafay's stance on support and reliability compared to established providers?

Rafay is committed to providing enterprise-grade support and reliability, comparable to established providers like AWS and Google. The platform offers dedicated support teams to assist with any issues, ensuring minimal downtime and quick resolutions. Rafay’s technology stack is built on mature, widely adopted open-source technologies like Kubernetes and Kubeflow, which are trusted across the industry. This foundation provides a robust and reliable infrastructure for AI/ML workloads. Additionally, Rafay’s focus on MLOps allows for specialized support that may not be available with more generalized cloud providers. By combining proven technologies with dedicated, specialized support, Rafay aims to deliver a reliable and well-supported platform that meets the high standards expected in enterprise environments.

How do Rafay's GPU PaaS and MLOps offerings benefit an AWS sales team?

Rafay’s offerings complement AWS services in two key ways, benefiting both customers and AWS sales teams. For customers using SageMaker and Bedrock, Rafay enhances AWS’s ecosystem with additional cloud-native and Kubernetes management capabilities.

For customers hesitant to use SageMaker or Bedrock, Rafay provides a similar experience that can be fully deployed within AWS accounts, addressing concerns about cost or data exposure.

Crucially, Rafay’s solutions drive direct compute consumption on AWS, contributing to customers’ Enterprise Discount Program (EDP) commitments. This helps AWS sales teams meet their targets and potentially expand future EDPs, making Rafay a valuable partner in the AWS ecosystem that can increase overall AWS usage and revenue.