Architecture and Design Considerations for Platform Engineering Teams

Platform engineering is not a new concept and has existed for a long time in companies such as Google, Amazon, Facebook, Netflix and many other large companies. For any large-scale product-engineering team, a platform is a set of standard services, frameworks and patterns that are originally developed by one or more teams for their use that can be leveraged by the other teams of the organization.

The rest of the engineering organization either uses these platform services to develop other applications or services or as internal tools.

Product teams used to build many of these shared services and tools internally when the open source, commercial frameworks and platforms as services and tools were unavailable. A good example is Google’s internal container management platform Borg, which eventually became Kubernetes.

Another good example is Kafka, an open-source messaging platform that LinkedIn originally built for internal use. Similarly S3 and EC2 from Amazon Web Services (AWS), which were initially developed for internal use, became the core foundation of AWS public cloud.

Thanks to the availability of these platforms, frameworks, and tools either as open-source or commercial products, platform teams no longer need to build them for application development or self-service cloud infrastructure.

What Is a Platform?

So what exactly is a platform in the context of cloud native application development, deployment and management? Is it an internal developer platform, a developer self-service portal, a developer-experience tool or simply a developer onboarding tool? Are there users other than developers who use the platform?

The answer seems to be “yes” for all the questions. The platform that we refer to now is a combination of all the above and more. Since most of the foundational services are available as either open source or commercial products or both, a platform engineering team’s main goal is to make these services and tools easily discoverable, readily available in a self-serve manner and more usable using standard interfaces such as API, UI, self-serve portals, Terraform, etc.

For example, platform teams may provide a cluster or container as a service to their end users so that each business unit or application team does not have to provision or manage Kubernetes infrastructure. Another example is application deployment as a service, where platform teams automate the application deployment process by providing tools such as Argo CD as a service.

Under the hood, platform teams may be leveraging various commercial or open source frameworks with some custom automation. Though developers are the primary internal users of the platform, other teams such as SRE, security, product support and FinOps can also immensely benefit from the platform.

To be ultimately successful, platform teams must address not only their developer use cases, but also the use cases of their other internal teams.

Such a platform may have one or more user interfaces, which developers and other internal users employ to easily consume these services in a self-serve manner with minimal assistance from the platform team.

The user interface may be different for different user personas. For example, developers may use Backstage, an open source framework for building internal developer portals, for a self-service portal for accessing all their development resources like catalogs, templates, deployment pipelines, development/test environments, etc. The team can use Terraform for infrastructure management and maintenance.

Behind the user interface is the platform’s backend, which brings together all the organization’s common frameworks, infrastructure, services and tools, and provides them as standard services to their end users via one or more user interfaces.

The organization’s security, governance, and compliance requirements are also baked into the backend to apply across all platform services so that the requirements are enforced consistently across the organization.

Platform Architecture

In its simplest form, the platform may be seen as two components (as shown below) — a frontend comprising one or more end-user interfaces and a backend that provides the necessary infrastructure, services and tooling automation to the frontend, therefore enabling end users to use these capabilities in a self-serve manner for better productivity, accelerated product development and consistent security and governance policy control.

The frontend for developers can either be a simple homegrown portal, an advanced Backstage deployment, or a commercial internal developer platform (IDP) solution. Similarly, for SRE teams, the frontend may consist of a set of common Terraform modules developed by the platform team for provisioning and managing the infrastructure.

For some teams, it might mean declarative specifications of the infrastructure resources that can be checked into a git repository, and the infrastructure resources are automatically provisioned and managed via GitOps.

The backend is essentially a collection of infrastructure automation, application services, developer-experience tools, SRE tools and frameworks, and security and governance policy-management tools. Platform teams typically build this backend by adding an additional automation layer on top of these infrastructure, services and tools to make them available via the various frontends.

For example, it may be development of a custom plugin to allow developers to create a developer sandbox from a Backstage portal. Similarly, it may be a Terraform module for creating Kubernetes clusters with all the required add-ons and policies, which the SRE/operations team can use to create clusters with consistent configuration. Each of the major components of a platform’s backend is described below:

Infrastructure

This component provides the automation required to provision and orchestrate the public/private cloud resources. The automation may include provisioning basic infrastructure resources like virtual private clouds, identity and access management roles and load balancers to complex resources like Kubernetes clusters, complete environments, etc. Platform teams commonly use Infrastructure-as-Code and GitOps practices for automation.

Services

Every application team uses a variety of services and tools in their application development that are not part of the core application. These may range from basic services such as container registry, CI/CD pipelines and Vault as a service for secret management, to advanced services like messaging, caching, data backup, disaster recovery, etc.

Platform teams automate these services and provide them to application teams in an easy interface to onboard/integrate — reducing toil and cognitive load on the application teams and enabling them to focus more on the core application development for accelerated product delivery.

Observability

Observability involves collecting data from the running systems that can be used for troubleshooting and fixing issues, analyzing resource usage for performance optimization, collecting metrics for capacity planning or building early warning systems to detect any potential problems before they occur, etc.

Logging, metrics and tracing are the essential components of the observability stack. Platform teams typically use open source and/or commercial solutions, and may implement additional automation to seamlessly integrate with various applications for data collection; they provide it to developers and SRE/Ops teams for analysis and troubleshooting.

SRE

Apart from the observability tools, SRE and operations teams also use a lot of other tools and technologies to manage and operate large-scale application infrastructure. These may include automation for fleet infrastructure management and operations, chaos engineering, incident management, alert management, custom troubleshooting tools for advanced debugging, self-healing, etc.

Platform teams may standardize on open source and commercial products for some of these tools and make them available to the SRE teams. Similarly, platform teams may develop custom solutions for fleet management, advanced debugging and self-healing type of use cases, as these use cases can be very specific to their infrastructure and applications.

Developer Experience (DevEx)

Developers are often forced to work on repetitive things whenever they start developing a new service or an application. This is especially more prevalent in large organizations with many internal application teams, product teams and business units where there is often little sharing of code and tools across the teams. These repetitive tasks may include creating boilerplate code templates for a new service that already exists, setting up a testbed, spinning up dev/test environments, etc.

Apart from this, developers also need to know information about the services they own —what resources it uses, how healthy the service is, when it was last changed and how to view the latest logs. Platform teams can provide these capabilities through a unified developer portal leveraging Backstage or some other developer portal as well as the capabilities for automating the repeatable tasks involved in such tasks.

Security and Governance

InfoSec and security teams define a security framework and a baseline security posture for all the components, services and infrastructure the entire organization uses. This may entail enforcement of all the security policies and practices consistently across all the systems to meet the security baseline posture, continuous validation of the baseline to detect any deviation and quick remediation in the event of violations.

The security baseline policies include single sign-on and role-based access controls, network security, Open Policy Agent (OPA) for implementing granular compliance and security policies at the resource level, image scanning for vulnerabilities, runtime container security, CIS benchmark tests, etc.

Platform teams need to apply these cost-control policies automatically at every stage of application development, deployment and management, and also at the infrastructure level.

For governance, cost-control policies are essential for every organization. Platform teams need to apply these cost-control policies automatically at every stage of application development, deployment and management, and also at the infrastructure level. For instance, it may mean automatically installing a set of approved system add-ons and OPA policies, network policies and cost control policies for a Kubernetes cluster deployment.

Conclusion

There is no one-size-fits all approach to platform engineering. It boils down to each organization’s specific requirements, priorities and what they want to accomplish via a platform. The platform is not just an IDP or a Backstage deployment or a self-serve portal. Developers are not necessarily the only users of the platform. Platform teams must thoroughly understand all of their internal user personas, and their needs, and develop the right kind of backend and user interfaces for the platform that delivers maximum value to all their internal users.

Author

Hemanth Kavuluru

View all posts

Streamline AI/ML Adoption: Expert Strategies to Conquer IT Hurdles and Accelerate Growth.