How to Automate Upgrades to Amazon EKS 1.24 Stargazer

	EKS Version	Released	End of Support
	1.28	September 2023	November 2024
	1.27	May 2023	July 2024
	1.26	April 2023	June 2024
	1.25	February 2023	May 2024
	1.24	November 2022	January 31, 2024
You are here →	1.23	August 2022	October 11, 2023	← You are here

It’s that time again – IT operations teams around the world are once more sizing up an update of their Amazon Elastic Kubernetes Service (EKS) deployments to version 1.24, before version 1.23 reaches end of support (EoS). Maybe you work for one of them! If you do, you may have experienced two alternating feelings when EKS upgrade season arrives: joy (when a technical limitation your organization has been struggling to deal with has finally been fixed) and weariness (when a new EKS EoS date arrives alongside a long checklist of things to work through before you can run the upgrade successfully).

Whether you’re new to this process or a veteran, let’s take a few minutes to understand the process a little better, look more closely at what’s changed in the latest version of EKS, and discuss strategies that can help make these seasonal events, dare I say, painless.

Why are we doing this (again)?

The success of Kubernetes across cloud infrastructures everywhere, and Amazon’s EKS in particular, is due to a long list of factors too numerous to list here. But certainly, two of them are its stability and its rapid release schedule. These two factors may appear to be at odds, but the architects of open source Kubernetes have truly done a remarkable job at striking the right balance between remaining responsive to the needs of the community using K8s, while simultaneously protecting the stability of all the organizations that depend on it for basic survival.

With EKS, Amazon continues to maintain the largest distribution of Kubernetes in the public cloud, and a major driving force behind its broad adoption is its ongoing commitment to remain current with the pace of Kubernetes’ development. Amazon releases about 3 version updates to EKS a year, in alignment with the release calendar of upstream Kubernetes. Each version is supported by Amazon for about 14 months, which means that at any given time, there are about 3-5 supported versions of EKS.

Unfortunately, updating your deployed version of EKS usually isn’t as simple as updating the software on your smartphone. Each version update comes with a long list of things to check and potentially fix before you can run it (we’ll cover the specific changes in EKS 1.24 in a little more detail below). This is why IT operations teams often have to plan for weeks before executing an EKS upgrade across their cloud environments.

But EKS versions are supported for 14 months, which seems like a long time to go before having to worry about applying updates, right? Unfortunately, the picture isn’t quite that rosy. Amazon doesn’t currently support skipping versions when doing cluster upgrades – EKS users must update from their current point release to the next adjacent point release, and so on. Depending on the nature of the updates, each has the potential to be a time consuming affair, and most IT operations teams only have the bandwidth to deal with one update at a time. So they are stuck in an endless cycle of jumping from lily pad to lily pad before the loss of support pulls them underwater.

What’s changing in EKS 1.24?

As you know by now, every new EKS version has a unique batch of changes that must be parsed. Amazon EKS 1.24 (also called the Stargazer release) has a number of important changes that EKS customers will want to pay attention to:

Dockershim has been removed – This is the most impactful change by far. The Container Runtime Interface (CRI) for Docker (aka Dockershim) has been pulled, which means that only containerd will be supported going forward. OCI images generated by docker build tools will continue to run in EKS, and Docker can still be used to build containers outside the cluster. However, any tooling that includes a dependency to Dockershim will need to be checked and updated.

Beta APIs are now inactive by default – To reduce potential exposure of customers to software bugs, new beta APIs are no longer enabled by default in EKS clusters. This doesn’t affect existing beta APIs and new versions of existing beta APIs, which will continue to be enabled.

Certificates controller behavior changes – Until 1.24, kubelet serving certificates with unverifiable IP and DNS Subject Alternative Names (SANs) were issued with unverifiable SANs. Now, no kubelet-serving certificates will be issued if SANs cannot be confirmed. This will block the function of kubectl exec and kubectl logs commands.

EKS now offers availability zone hints – EKS customers often use multiple AWS availability zones for resiliency but sometimes find themselves inadvertently paying excess transfer charges for data moving between them. Amazon now offers Topology Aware Hints to help EKS users keep Kubernetes service traffic within the same availability zone to avoid those charges.

Pod security is evolving – Kubernetes is moving from Pod Security Policy (PSP) to Pod Security Admission (PSA), a built-in admission controller. A beta version of this controller is now available, and EKS users should move to it before support for PSP is removed in EKS 1.25. They can also use solutions like Kyverno or OPA Gatekeeper to serve the same function.

New autoscaling for EKS managed node groups – Amazon has contributed a feature to the upstream Cluster Autoscaler project that simplifies scaling Amazon EKS managed node groups (MNG) to and from zero nodes. When there are no running nodes in the MNG, the Cluster Autoscaler will call the EKS DescribeNodegroup API to get the information it needs about MNG resources, labels, and taints.

Kublet credential provider is in beta – Amazon EKS Anywhere users may be pleased to learn that 1.24 includes kubelet support for image credential providers. You can now request credentials for a container registry dynamically, as opposed to storing static credentials on local storage.

Automated Fleet Operations makes upgrades easier

Sounds complicated? Believe it or not, there is hope! Cloud automation can dramatically streamline your Kubernetes upgrade process, making all your cluster upgrades faster, smoother, and less risky.

Rafay’s cloud automation platform has built in capabilities for Automated Fleet Operations, which eliminate the error-prone, manual processes of traditional Kubernetes cluster fleet operations by implementing repeatable, automated workflows that can be applied to multiple clusters. This allows platform and operations teams that are already stretched thin to increase productivity and improve the reliability of lifecycle management operations (like cluster upgrades) performed on Kubernetes clusters fleet-wide.

Batch automation of comprehensive lists of actions, including cluster upgrades and configuration patching, can be carried out on any number of clusters. Platform and operations teams can trigger actions multiple times and the actions will run until completion in a secure manner that meets all compliance requirements.

For cluster upgrades in particular, there are two paths you can take.

Use an automated upgrade plan customized for the release being updated

Automated Fleet Operations was designed specifically for these kinds of common scenarios, helping organizations that depend on Kubernetes address two foundational issues with upgrades:

Easily customizable upgrade plans – Organizations can create highly tuned upgrade plans that allow them to perform upgrades with zero downtime or impact to applications and cloud users.

Upgrading at scale – Performing upgrades of a fleet of Kubernetes clusters in an organization one-by-one can be extremely painful and cumbersome. With Rafay’s fleet, organizations can orchestrate the automation of upgrades of an entire cluster fleet, or portions thereof targeted by cluster labels.

To make the upgrade to EKS 1.24 as painless as possible for our customers, Rafay has created a reference implementation for a validated upgrade plan specifically to get users to 1.24 with zero impact to the resident applications. The plan allows Rafay to execute the process depicted below completely automatically, after just a few tweaks to customize the plan for their specific cloud environment.

With this fleet plan, Rafay automates the following actions:

Check for deprecated APIs – Ensures that the cluster is compatible with the new version by checking for any APIs that have been deprecated.

Check for Docker socket mounts – Checks if any applications in the cluster are using Docker socket mounts, since Dockershim has been removed in 1.24. Any applications using Docker socket mounts will need to be updated before upgrading to 1.24.

Cluster upgrade – Once the above prechecks have passed, the action for cluster upgrade will be triggered to start upgrading the fleet of clusters to 1.24.

Rafay plans to continue to release customized fleet upgrade plans for its customers for future versions of Amazon EKS, as well as other managed K8s services like Microsoft AKS.

Use a blue/green upgrade strategy

The automations above go a long way towards dramatically reducing the work involved for EKS updates – but they only buy you a few months before the next End of Support date rolls around. If you are looking to give your organization a bit more runway, there’s another option.

As mentioned earlier, Amazon doesn’t allow you to skip versions when upgrading your clusters’ EKS. But there’s nothing stopping you from creating a NEW cluster at the target version and migrating your workloads over to it. This is called the blue/green approach, and automation platforms like Rafay can greatly simplify its implementation.

This approach is well suited for scenarios where an extremely low blast radius is required. In the short term, it’s a bit costlier because this upgrade strategy will essentially duplicate infrastructure costs. But in exchange, users can run their applications on two clusters at different versions, and retain the option to switch back and forth between the old and new clusters as required. They can use networking techniques, like DNS updates, to switch traffic going from the old (blue) cluster to the new (green) cluster. This simple DNS change allows users to easily test the newer versions of Kubernetes on the green cluster and rollback to the blue cluster if needed.

Conclusion

The pace of change in modern clouds is only increasing, and modern platform engineering and IT operations teams will not be able to scale to meet these challenges without heavy use of automation tooling. The EKS upgrade process is a great example of something that seems routine but gets exponentially complicated as the size of your cloud (and its user base) grows.

Rafay provides a turnkey cloud automation solution in a centralized, easy-to-use SaaS platform that bridges these complexity gaps, so your business can focus on rapid innovation rather than cloud management. Rafay’s solution allows platform teams to rapidly and safely enable operations, developers, data scientists, engineers, and other cloud stakeholders to move faster.

If you’re interested in a demonstration of our EKS upgrade capabilities, or any of the other automations our platform provides, please reach out!

Author

Anirban Chatterjee

View all posts

Streamline AI/ML Adoption: Expert Strategies to Conquer IT Hurdles and Accelerate Growth.