Optimizing Amazon EKS: Advanced Configuration, Scaling, and Cost Management Strategies

Amazon’s Elastic Kubernetes Service (EKS) makes it easy to provision and operate cloud-hosted Kubernetes clusters using AWS. It’s a managed service that automates the process of creating a control plane and connecting AWS EC2 instances that act as cluster nodes.

Using EKS can drastically reduce a Kubernetes admin’s workload but it’s still vital to correctly optimize your clusters for performance, scalability, and cost efficiency. EKS users aren’t immune to common Kubernetes problems including resource over-provisioning, complicated app deployment processes, and unpredictable costs. In this article, we’ll discuss three advanced strategies that help address these problems to deliver a dependable EKS experience

Understanding EKS Optimization Challenges

EKS is a managed Kubernetes platform that automates key cluster administration and maintenance tasks. Compared with self-managing Kubernetes on standard cloud hosts, EKS simplifies cluster provisioning, provides integrated scaling, security, and cost management options, and can support seamless high availability by distributing control plane components across multiple regional zones.

The problem is you can still easily end up with sub-optimal infrastructure or excessive spending. Common pain points experienced by EKS cluster operators include:

Over-provisioned workloads — An over-provisioned deployment has access to more physical resources (such as CPU or memory) than it actually requires. This causes cluster capacity to be wasted and can mean you end up paying for more nodes than you require.
Under-utilized clusters — Similarly, under-utilization can also occur at the cluster level: it’s possible for your workloads to be right-sized, but your cluster’s node fleet to be excessively large or powerful. In this case, reducing your node count or switching to a more appropriate instance type can improve efficiency, without affecting performance.
Unexpected cost spikes — EKS clusters can be expensive to operate at scale and it’s sometimes hard to detect what’s affecting your bill. Unexpected costs can accrue for a variety of reasons, including incorrect auto-scaling settings and ineffective use of AWS savings options.

Devising a pragmatic EKS optimization strategy is an essential step before you start operating your clusters. With a little planning, it’s possible to mitigate these challenges and ensure long-term reliability with predictable costs.

Advanced Strategies for EKS Optimization

The following three techniques are dependable ways to optimize the performance, efficiency, and management of EKS clusters. Implementing these methods will help improve your architecture, especially when combined with good adherence to the Kubernetes best practices that affect all cluster deployments.

1. Configure Cluster Autoscaling

Cluster autoscaling is a mechanism that automatically adjusts your EKS cluster’s node count in response to changing utilization. It’s used in combination with app-level autoscaling to make your infrastructure more resilient in times of high demand.

App autoscaling starts new Pod replicas as user demand grows. However, it’s possible the extra replicas could be unschedulable if your cluster’s capacity has already been reached. Cluster autoscaling handles this by dynamically provisioning additional EC2 node instances, expanding physical capacity to accommodate increased utilization. When demand subsides, the number of provisioned nodes automatically scales back down. This prevents cluster under-utilization and reduces your operating costs.

EKS autoscaling is provided by the Kubernetes Cluster Autoscaler, a component that’s maintained as part of the Kubernetes project. To enable it, you need to deploy the Autoscaler into your cluster and set up an AWS IAM policy that provides the required access to your account. The autoscaler will then continually monitor your cluster’s load and create or remove EC2 instances as required for your current workloads.

Particularly in larger clusters, you might need to tune the autoscaler itself to ensure an optimal balance between cost and availability. Best practices to follow include:

Minimize the number of node groups and instance types you create, providing scope for more Pods to be packed together on a node.
Configure an appropriate utilization scan interval — more frequent scans make your cluster more responsive to load changes, but can result in latency, rate limiting, and even unavailability due to the number of Kubernetes API calls made.
Enable overprovisioning if your workloads regularly experience load spikes. This will keep some spare nodes active, minimizing delays caused by starting new EC2 instances during scale-ups.

You can read more recommendations in the documentation.

2. Implement Spot Instances to Achieve Cost Savings

Using EC2 Spot Instances for your nodes is an easy way to score big savings on your EKS operating costs. Whereas regular On-Demand Instances have a fixed price and are continually available, Spot Instance pricing is variable and instances may be stopped with a two minute notice period.

Spot Instances represent unused EC2 capacity that Amazon’s prepared to sell at a heavy discount — up to 90% off the equivalent On-Demand Instance price. But because EC2 usage is continually changing, Spot Instances sometimes have to be evicted as other customers create new On-Demand or Reserved instances. When this happens, a stop notification is sent and the instance is then reclaimed.

The ephemeral nature of Spot Instances means they aren’t suitable for every use case. However, they’re often a good fit for EKS, where cluster nodes can be readily replaced without introducing faults into your infrastructure. Selecting Spot Instances instead of On-Demand Instances could drastically cut your bill without substantially affecting availability. These savings aren’t just theoretical, either — UK TV service ITV achieved a $150,000 cost saving by using Spot Instances with EKS, while cloud migration company SourceFuse observed a 75% reduction.

Effectively utilizing Spot Instances does require care, but this can be easily managed when you’re already using cluster autoscaling to optimize capacity. It’s recommended that you create Spot Instance pools across multiple availability zones, instance types, and AWS regions, as this will help minimize the risk that too many of your instances are evicted at the same time. You can also use Rafay to simplify your EC2 Spot Instance configuration.

3. Leverage Kubernetes Operators for Automated EKS Management

Many apps are relatively complex to deploy in Kubernetes, requiring multiple manifest files to correctly configure storage, networking, and high availability. The operator pattern is a solution that makes it easier to run these workloads by facilitating app-specific automation.

Operators are available for many popular apps used by DevOps teams. The MySQL Operator makes it simple to deploy read-replicated MySQL databases to your cluster, for example, while the Prometheus Operator simplifies the installation of Prometheus with related in-cluster monitoring components.

To utilize an operator, you must first install it in your cluster—this typically requires you to use Kubectl or Helm to deploy the operator’s manifests. Once installed, the operator will register Kubernetes Custom Resource Definitions (CRDs) that allow you to deploy app instances by creating objects in your cluster. For example, you can create a database via the MySQL Operator by using the InnoDBCluster CRD.

Most operators work with any Kubernetes cluster, not just EKS ones, but there are operators that are specifically designed to reduce the burden on AWS and EKS administrators. Rancher’s EKS Operator is an operator that allows you automate cluster provisioning, while the AWS Service Operator can be used to manage other resources in your AWS account. These operators can help you establish your Kubernetes cluster as the single source of truth for your AWS infrastructure.

Continually Monitoring and Optimizing EKS Costs

However you manage EKS, it’s important to keep a handle on costs. Regularly monitoring bills and analyzing how costs have been accrued helps identify inefficiencies so you can improve your clusters and start making savings.

The strategies discussed above are an effective way to begin cost optimization, such as by switching to Spot Instances and configuring autoscaling to prevent under-utilization. You can monitor costs associated with your cluster within the AWS Billing Console or by using dedicated cost optimization tools such as Kubecost or Rafay.

Crucially, our solution gives you a consolidated view of your spending across all your clusters, including those in EKS, other clouds, and on-premises datacenters. Precise attribution of costs to different teams and apps in shared clusters allows you to easily identify who’s driving the most resource consumption, without requiring any additional infrastructure.

Optimize Your EKS Operations With Autoscaling, Spot Instances, and Kubernetes Operators

Amazon EKS brings welcome simplicity to Kubernetes operations in the cloud. Nonetheless, we’ve seen that there’s still steps to take to ensure your clusters remain performant, scalable, and cost efficient.

In this article, we’ve shared three key strategies for maximizing your EKS ROI. Cluster autoscaling optimizes utilization, while enabling AWS Spot Instances can realize substantial per-node cost savings. These techniques enhance your cluster-level management, while adopting applicable Kubernetes operators can simplify your app deployment workflows.

Applying these best practices will help you avoid potentially costly pitfalls so you can capitalize on the full benefits of EKS. To continue your journey, read our guide to getting started with EKS or check out how to further streamline your EKS operations with Rafay’s Kubernetes Operations Platform — including cohesive multi-cluster management, GitOps deployments, and policy-based security governance.

Author

James Walker

View all posts

A couple of hours is all it takes to launch a GPU Cloud