The Kubernetes Current Blog

Introducing Automated Fleet Operations, Rafay’s Solution To Simplified Management Of Large Kubernetes Environments

The increased adoption of containers coupled with simplified Kubernetes cluster deployment in public clouds has led to a meteoric rise in the number of Kubernetes clusters across enterprise organizations. Gartner predicts that over 25% of all enterprise applications will run in containers by 2027, and this expansion of container use is expected to drive further investment in Kubernetes clusters. This rapid growth has led to a new set of management challenges and complexities, including the following:

  1. Clusters existing in multiple cloud subscriptions or accounts spread across the organization
  2. Clusters distributed in various environments like dev, test, standing and production
  3. Clusters in different regions
  4. Checks and other processes that must be run manually before or after key cluster operations
  5. Monitoring cluster health and status
  6. Proliferation of clusters with differing configurations
  7. Clusters deployed across various cloud providers like EKS, AKS, GKE etc.
  8. Limited tooling to manage cluster fleets

It can be a daunting task for the Platform and Operations admins to efficiently manage these large numbers of clusters or apply consistency from a health, security, and kubernetes governance perspective. As a consequence, productivity and reliability can be affected. Properly implemented fleet-based management can help reduce these complexities and manage Kubernetes clusters in an efficient, consistent, and predictable manner.

Automated Fleet Operations: How it Works

Rafay’s Automated Fleet Operations is a new capability that helps address the challenges of managing a fleet of Kubernetes clusters spanning various environments, cloud providers, data centers and accounts. It is a solution that solves, at scale, the challenges associated with multi-cluster management. At launch, Rafay’s Automated Fleet Operations will support Amazon EKS and Azure AKS, and progressively add support for other distributions like GKE and upstream K8s every few weeks.

To get started with Rafay Automated Fleet Operation, administrators specify a fleet plan. This is a master specification which helps them define the following components:

  1. Target clusters (identified using cluster names or cluster labels)
  2. Desired operation (e.g. upgrade, scale, etc.)
  3. Pre-hooks (e.g. preflight checks that should be run before executing the plan: Checking for deprecated K8s APIs before an upgrade, approvals, etc..)
  4. Post-hooks (e.g. postflight checks that should be run after executing the plan: validation of upgrade, cluster state etc..)

Each fleet plan consists of a “fleet controller” that manages the state machine and fleet pipelines that manage the execution of fleet workflows. Below is a high level view into the various elements and flow of a sample fleet plan, which could support any fleet operation such as a cluster upgrade:

Once created, fleet plans can be executed based on a predefined schedule or a manual trigger. Access is managed using role-based access control (RBAC). This ensures that only authorized users are able to define, view and execute the fleet plans. Customers can also optionally define an approval process as part of the pre-hooks to ensure they are protected from unauthorized execution. Administrators can track the status of each execution and define on-failure and on-success conditions including retries to manage the fleet execution in an intuitive dashboard (as shown below).

Rafay Automated Fleet Operations is available for users of the platform via an intuitive and user-friendly web console. Users can also automate everything using the Rafay CLI, API, and Terraform provider.

Rafay Fleet Plan status dashboard

Benefits of Automated Fleet Operations

By leveraging Rafay’s Automated Fleet Operations, users can not only address the complexities and challenges we described in the introduction section but also gain valuable benefits, including:

  1. Enhanced visibility thru centralized management of clusters
  2. Improved productivity & operational efficiency
  3. Better compliance with organizational governance and security policies
  4. Increased consistency of cluster operations applied across the fleet
  5. Higher availability of key resources for developers

Automated Fleet Operations will be available to customers later this year. If you would like to learn more about the capabilities of this new enhancement to the Rafay platform as you consider its use in your production and dev/test environments, please reach out to us!


Trusted by leading companies