etcd & Kubernetes: What You Should Know

Kubernetes is architected as a set of microservices that manage the lifecycle of containers and coordinate application management tasks such as configuration, deployment, service discovery, load balancing, scheduling, scaling, and monitoring across a fleet of clusters. The microservices-based architecture of the Kubernetes control plane offers the flexibility and resiliency to scale up and down according to the demands of the workloads. However, this very nature of distributed microservices architecture requires a reliable and performant data store that can act as a single source of the truth — and that datastore is etcd.

Kubernetes Components (Image source: Kubernetes.io)

Etcd is a CNCF graduated open source project and is a distributed, reliable and highly available key-value store. Written in Go, etcd gets its name from the UNIX directory structure naming convention. In UNIX, all the system configuration files are stored in a directory called “etc”. A “d” is augmented to “etc” to represent etcd’s distributed model. Etcd is an integral part of the Kubernetes control plane.

Etcd stores Kubernetes cluster configuration and state data such as the number of pods, their state, namespace, etc. It also stores Kubernetes API objects and service discovery details.

So what makes etcd the control plane data store of choice for Kubernetes? It’s because it has the following key qualities:

Fully Replicated: Every node in an etcd cluster has access to the complete data store and hence capable of becoming the primary data source at any moment without a glitch.
Consistent: Every data read from etcd returns the latest data across all clusters.
Highly Available: etcd is deployed in a highly available fashion with three or more odd number nodes. This ensures that there is no single point of failure due to network connectivity issues, power failures, hardware issues, unplanned maintenance, etc.
Speed: etcd has been tested and benchmarked to perform 10K writes per second.
Secure: etcd supports Transport Layer Security (TLS) and Secure Socket Layer (SSL) based authentication.

Etcd is built on the Raft consensus algorithm which ensures datastore consistency across all the nodes. Raft divides the nodes in the cluster into a Leader and Followers. The elected Leader node manages the data replication for all of the Follower nodes in the cluster. See here for more details on how Raft works and handles failure scenarios.

Design and Deployment Considerations for etcd

Etcd, without a doubt, is a critical and core component of the Kubernetes control plane. Here are some key design and deployment considerations for etcd:

High Availability

Etcd inherently is a highly available system with multiple nodes running in quorum as discussed before. However, when you are deploying Kubernetes components, including etcd, you need to place the components on different physical or virtual nodes to avoid single points of failure at the infrastructure level. There are two ways in which you can do this:

Stacked etcd: In stacked etcd mode, etcd is co-located with the other Kubernetes control plane components across different nodes as shown below. The advantage of stacked etcd mode is that all the Kubernetes control plane components can be deployed on three nodes.

Stacked etcd Deployment (Image source: Kubernetes.io)

External etcd: In external etcd mode, etcd is deployed on a separate set of nodes from the Kubernetes control plane components as shown below. The advantage of external etcd mode is that you can use dedicated data backup and restore strategies separate from the Kubernetes control plane component nodes. However, the downside is that you would need three extra physical or virtual nodes to run etcd separately.

External etcd Deployment (Image source: Kubernetes.io)

Placement

It is prudent to deploy etcd nodes across different availability zones in a public cloud or across different data center locations to guard against power failures, natural disasters and any such incidents.

Latency and Throughput

Etcd being a distributed system relies on the underlying network and storage infrastructure to perform at acceptable levels of latency and throughput. Latency is the time taken to complete an operation. Throughput is the total operations completed within a time period. Check out this article to learn more about etcd timing and other parameters and how to tune them. See this article to learn more about etcd benchmarking and how you can benchmark etcd in your specific environment using the etcd benchmarking that is included with the etcd package. Specific to storage, it is recommended to use high speed storage like SSDs and NVMe drives to store etcd data.

Security

Among many things, Kubernetes secrets are also stored in the cluster’s etcd database. So, etcd has to be secured since it can be a prime target for attackers. Following are the key security considerations while deploying etcd:

Secure Storage: Secure the etcd storage system by ensuring that the entire disk underlying etcd is encrypted. This also makes it operationally easier to dispose of the disks when they are no longer useful. The secrets stored in etcd should always be strongly encrypted when written. Implement strong key management for the symmetric encryption keys.

Secure Access: Secure the system by ensuring that etcd is configured to require mutually authenticated TLS for access by clients (see below).

Etcd Access with mutually authenticated TLS

Authorization: In multi-master Kubernetes cluster deployments, the etcd servers listen on all interfaces. Therefore, limiting who/what can access etcd is a critical security control. Secure the system by ensuring that access to etcd is restricted to specific clients only. i.e. only the API server is allowed to connect to etcd. Mandate the use of strong certificate-based mutual authentication for access. Use a different Certificate Authority (CA) for protecting access to etcd from the one used for Kubernetes. This would deny access from non API server Kubernetes components to the etcd cluster.

Synchronization: In a high availability configuration, secure the system by ensuring that mutually authenticated TLS is required for all etcd “peer-to-peer” communication (see below).

Mutually authenticated TLS for all etcd “peer-to-peer” communication

Role-Based Access Controls: Since etcd stores vital and highly sensitive configuration data, DevOps teams should implement role-based access controls within the deployment and ensure that team members interacting with etcd are limited to the least-privileged level of access necessary to perform their jobs.

Backup and Restore

Implement a resilient data backup mechanism to regularly backup etcd data to ensure timely recovery in the case of a failure or data loss. You can utilize a volume backup solution to backup the etcd volumes, or use a Kubernetes cluster data backup mechanism as described here and rebuild the etcd database, or any other backup mechanism that suits best to your environment.

Etcd: The Custodian of the Kubernetes Control Plane Data. But, There is More…

Etcd plays a crucial role in maintaining and storing Kubernetes cluster configuration, state, Kubernetes API objects and service discovery details. It is essential to take extra care to provide the right infrastructure to run etcd with sound availability, security and backup choices. Be sure to guard against vulnerabilities by patching and running the latest and battle tested version of etcd in your environments.

Throughout this blog, we discussed the essential aspects of etcd – a single but very crucial component of Kubernetes. If you are planning to, or are already running a fleet of Kubernetes clusters, then you should also be thinking about the heavy lifting required to manage the full lifecycle of those clusters — along with etcd — and your preferred add-ons for security, secrets management, storage, backup, load balancing, and other components. To streamline that take a look at the Rafay Kubernetes Operations Platform.

To learn more about etcd, Kubernetes and how to manage the lifecycle of clusters, check out these additional educational resources:

Etcd: Options for high availability topology at Kubernetes.io
Kubernetes.io – Operating etcd clusters for Kubernetes
Best practices checklist for getting started with Kubernetes including high availability and security for etcd
See how Verizon, SonicWall, and Minim use Rafay Kubernetes Management Cloud to manage their distributed global fleet of Kubernetes clusters in the cloud, data center and at the edge

Author

Naren Narendra

View all posts

A couple of hours is all it takes to launch a GPU Cloud