Azure Well-Architected Framework review - Azure Kubernetes Service (AKS)

This article provides architectural best practices for Azure Kubernetes Service (AKS). The guidance is based on the five pillars of architecture excellence:

  • Reliability
  • Security
  • Cost optimization
  • Operational excellence
  • Performance efficiency

We assume that you understand system design principles, have working knowledge of Azure Kubernetes Service, and are well versed with its features. For more information, see Azure Kubernetes Service.

Prerequisites

Understanding the Well-Architected Framework pillars can help produce a high-quality, stable, and efficient cloud architecture. We recommend that you review your workload by using the Azure Well-Architected Framework Review assessment.

For context, consider reviewing a reference architecture that reflects these considerations in its design. We recommend that you start with the baseline architecture for an Azure Kubernetes Service (AKS) cluster and Microservices architecture on Azure Kubernetes Service. Also review the AKS landing zone accelerator, which provides an architectural approach and reference implementation to prepare landing zone subscriptions for a scalable Azure Kubernetes Service (AKS) cluster.

Reliability

In the cloud, we acknowledge that failures happen. Instead of trying to prevent failures altogether, the goal is to minimize the effects of a single failing component. Use the following information to minimize failed instances.

When discussing reliability with Azure Kubernetes Service, it's important to distinguish between cluster reliability and workload reliability. Cluster reliability is a shared responsibility between the cluster admin and their resource provider, while workload reliability is the domain of a developer. Azure Kubernetes Service has considerations and recommendations for both of these roles.

In the design checklist and list of recommendations below, call-outs are made to indicate whether each choice is applicable to cluster architecture, workload architecture, or both.

Design checklist

  • Cluster architecture: For critical workloads, use availability zones for your AKS clusters.
  • Cluster architecture: Plan the IP address space to ensure your cluster can reliably scale, including handling of failover traffic in multi-cluster topologies.
  • Cluster architecture: Enable Container insights to monitor your cluster and configure alerts for reliability-impacting events.
  • Workload architecture: Ensure workloads are built to support horizontal scaling and report application readiness and health.
  • Cluster and workload architectures: Ensure your workload is running on user node pools and chose the right size SKU. At a minimum, include two nodes for user node pools and three nodes for the system node pool.
  • Cluster architecture: Use the AKS Uptime SLA to meet availability targets for production workloads.

AKS configuration recommendations

Explore the following table of recommendations to optimize your AKS configuration for Reliability.

Recommendation Benefit
Cluster and workload architectures: Control pod scheduling using node selectors and affinity. Allows the Kubernetes scheduler to logically isolate workloads by hardware in the node. Unlike tolerations, pods without a matching node selector can be scheduled on labeled nodes, which allows unused resources on the nodes to consume, but gives priority to pods that define the matching node selector. Use node affinity for more flexibility, which allows you to define what happens if the pod can't be matched with a node.
Cluster architecture: Ensure proper selection of network plugin based on network requirements and cluster sizing. Azure CNI is required for specific scenarios, for example, Windows-based node pools, specific networking requirements and Kubernetes Network Policies. Reference Kubenet versus Azure CNI for more information.
Cluster and workload architectures: Use the AKS Uptime SLA for production grade clusters. The AKS Uptime SLA guarantees:
- 99.95% availability of the Kubernetes API server endpoint for AKS Clusters that use Azure Availability Zones, or
- 99.9% availability for AKS Clusters that don't use Azure Availability Zones.
Cluster and workload architectures: Configure monitoring of cluster with Container insights. Container insights helps monitor the health and performance of controllers, nodes, and containers that are available in Kubernetes through the Metrics API. Integration with Prometheus enables collection of application and workload metrics.
Cluster architecture: Use availability zones to maximize resilience within an Azure region by distributing AKS agent nodes across physically separate data centers. By spreading node pools across multiple zones, nodes in one node pool will continue running even if another zone has gone down. If colocality requirements exist, either a regular VMSS-based AKS deployment into a single zone or proximity placement groups can be used to minimize internode latency.
Cluster architecture: Adopt a multiregion strategy by deploying AKS clusters deployed across different Azure regions to maximize availability and provide business continuity. Internet facing workloads should leverage Azure Front Door or Azure Traffic Manager to route traffic globally across AKS clusters.
Cluster and workload architectures: Define Pod resource requests and limits in application deployment manifests, and enforce with Azure Policy. Container CPU and memory resource limits are necessary to prevent resource exhaustion in your Kubernetes cluster.
Cluster and workload architectures: Keep the System node pool isolated from application workloads. System node pools require a VM SKU of at least 2 vCPUs and 4GB memory, but 4 vCPU or more is recommended. Reference System and user node pools for detailed requirements.
Cluster and workload architectures: Separate applications to dedicated node pools based on specific requirements. Applications may share the same configuration and need GPU-enabled VMs, CPU or memory optimized VMs, or the ability to scale-to-zero. Avoid large number of node pools to reduce extra management overhead.
Cluster architecture: Use a NAT gateway for clusters that run workloads which make many concurrent outbound connections. To avoid reliability issues with Azure Load Balancer limitations with high concurrent outbound traffic, us a NAT Gateway instead to support reliable egress traffic at scale.

For more suggestions, see Principles of the reliability pillar.

Azure Policy

Azure Kubernetes Service offers a wide variety of built-in Azure Policies that apply to both the Azure resource like typical Azure Policies and, using the Azure Policy add-on for Kubernetes, also within the cluster. There are a numerous number of policies, and key policies related to this pillar are summarized here. For a more detailed view, see built-in policy definitions for Kubernetes.

Cluster and workload architecture

  • Clusters have readiness or liveness health probes configured for your pod spec.

In addition to the built-in Azure Policy definitions, custom policies can be created for both the AKS resource and for the Azure Policy add-on for Kubernetes. This allows you to add additional reliability constraints you'd like to enforce in your cluster and workload architecture.

Security

Security is one of the most important aspects of any architecture. To explore how AKS can bolster the security of your application workload, we recommend you review the Security design principles. If your Azure Kubernetes Service cluster needs to be designed to run a sensitive workload that meets the regulatory requirements of the Payment Card Industry Data Security Standard (PCI-DSS 3.2.1), review AKS regulated cluster for PCI-DSS 3.2.1.

To learn about DoD Impact Level 5 (IL5) support and requirements with AKS, review Azure Government IL5 isolation requirements.

When discussing security with Azure Kubernetes Service, it's important to distinguish between cluster security and workload security. Cluster security is a shared responsibility between the cluster admin and their resource provider, while workload security is the domain of a developer. Azure Kubernetes Service has considerations and recommendations for both of these roles.

In the design checklist and list of recommendations below, call-outs are made to indicate whether each choice is applicable to cluster architecture, workload architecture, or both.

Design checklist

  • Cluster architecture: Use Managed Identities to avoid managing and rotating service principles.
  • Cluster architecture: Use Kubernetes role-based access control (RBAC) with Azure AD for least privilege access and minimize granting administrator privileges to protect configuration, and secrets access.
  • Cluster architecture: Use Microsoft Defender for containers with Azure Sentinel to detect and quickly respond to threats across your cluster and workloads running on them.
  • Cluster architecture: Deploy a private AKS cluster to ensure cluster management traffic to your API server remains on your private network. Or use the API server allow list for non-private clusters.
  • Workload architecture: Use a Web Application Firewall to secure HTTP(S) traffic.
  • Workload architecture: Ensure your CI/CID pipeline is hardened with container-aware scanning.

Recommendations

Explore the following table of recommendations to optimize your AKS configuration for security.

Recommendation Benefit
Cluster architecture: Use Azure Active Directory integration. Using Azure AD centralizes the identity management component. Any change in user account or group status is automatically updated in access to the AKS cluster. The developers and application owners of your Kubernetes cluster need access to different resources.
Cluster architecture: Authenticate with Azure Active Directory (Azure AD) to Azure Container Registry. AKS and Azure AD enables authentication with Azure Container Registry without the use of imagePullSecrets secrets. Review Authenticate with Azure Container Registry from Azure Kubernetes Service for more information.
Cluster architecture: Secure network traffic to your API server with private AKS cluster. By default, network traffic between your node pools and the API server travels the Microsoft backbone network; by using a private cluster, you can ensure network traffic to your API server remains on the private network only.
Cluster architecture: For non-private AKS clusters, use API server authorized IP ranges. When using public clusters, you can still limit the traffic that can reach your clusters API server by using the authorized IP range feature. Include sources like the public IPs of your deployment build agents, operations management, and node pools' egress point (such as Azure Firewall).
Cluster architecture: Protect the API server with Azure Active Directory RBAC. Securing access to the Kubernetes API Server is one of the most important things you can do to secure your cluster. Integrate Kubernetes role-based access control (RBAC) with Azure AD to control access to the API server. Disable local accounts to enforce all cluster access using Azure AD-based identities.
Cluster architecture: Use Azure network policies or Calico. Secure and control network traffic between pods in a cluster.
Cluster architecture: Secure clusters and pods with Azure Policy. Azure Policy can help to apply at-scale enforcement and safeguards on your clusters in a centralized, consistent manner. It can also control what functions pods are granted and if anything is running against company policy.
Cluster architecture: Secure container access to resources. Limit access to actions that containers can perform. Provide the least number of permissions, and avoid the use of root or privileged escalation.
Workload architecture: Use a Web Application Firewall to secure HTTP(S) traffic. To scan incoming traffic for potential attacks, use a web application firewall such as Azure Web Application Firewall (WAF) on Azure Application Gateway or Azure Front Door.
Cluster architecture: Control cluster egress traffic. Ensure your cluster's outbound traffic is passing through a newtork security point such as Azure Firewall or an HTTP proxy.
Cluster architecture: Use the open-source Azure AD Workload Identity and Secrets Store CSI Driver with Azure Key Vault. Protect and rotate secrets, certificates, and connection strings in Azure Key Vault with strong encryption. Provides an access audit log, and keeps core secrets out of the deployment pipeline.
Cluster architecture: Use Microsoft Defender for Containers. Monitor and maintain the security of your clusters, containers, and their applications.

For more suggestions, see Principles of the security pillar.

Azure Advisor helps ensure and improve Azure Kubernetes service. It makes recommendations on a subset of the items listed in the policy section below, such as clusters without RBAC configured, missing Microsoft Defender configuration, unrestricted network access to the API Server. Likewise, it makes workload recommendations for some of the pod security initiative items. Review the recommendations.

Policy definitions

Azure Policy offers a variety of built-in policy definitions that apply to both the Azure resource and AKS like standard policy definitions, and using the Azure Policy add-on for Kubernetes, also within the cluster. Many of the Azure resource policies come in both Audit/Deny, but also in a Deploy If Not Exists variant.

There are a numerous number of policies, and key policies related to this pillar are summarized here. For a more detailed view, see built-in policy definitions for Kubernetes.

Cluster architecture

  • Microsoft Defender for Cloud-based policies
  • Authentication mode and configuration policies (Azure AD, RBAC, disable local authentication)
  • API Server network access policies, including private cluster

Cluster and workload architecture

  • Kubernetes cluster pod security initiatives Linux-based workloads
  • Include pod and container capability policies such as AppArmor, sysctl, security caps, SELinux, seccomp, privileged containers, automount cluster API credentials
  • Mount, volume drivers, and filesystem policies
  • Pod/Container networking policies, such as host network, port, allowed external IPs, HTTPs, and internal load balancers

Azure Kubernetes Service deployments often also use Azure Container Registry for Helm charts and container images. Azure Container Registry also supports a wide variety of Azure policies that spans network restrictions, access control, and Microsoft Defender for Cloud, which complements a secure AKS architecture.

In addition to the built-in policies, custom policies can be created for both the AKS resource and for the Azure Policy add-on for Kubernetes. This allows you to add additional security constraints you'd like to enforce in your cluster and workload architecture.

For more suggestions, see AKS security concepts and evalaute our security hardening recommendations based on the CIS Kubernetes benchmark.

Cost optimization

Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. We recommend you review the Cost optimization design principles.

When discussing cost optimization with Azure Kubernetes Service, it's important to distinguish between cost of cluster resources and cost of workload resources. Cluster resources is a shared responsibility between the cluster admin and their resource provider, while workload resources are the domain of a developer. Azure Kubernetes Service has considerations and recommendations for both of these roles.

In the design checklist and list of recommendations below, call-outs are made to indicate whether each choice is applicable to cluster architecture, workload architecture, or both.

For cluster cost optimization, go to the Azure pricing calculator and select Azure Kubernetes Service from the available products. You can test different configuration and payment plans in the calculator.

Design checklist

  • Cluster architecture: Use appropriate VM SKU per node pool and reserved instances where long-term capacity is expected.
  • Cluster and workload architectures: Use appropriate managed disk tier and size.
  • Cluster architecture: Review performance metrics, starting with CPU, memory, storage, and network, to identify cost optimization opportunities by cluster, nodes, and namespace.
  • Cluster architecture: Use cluster autoscaler to scale in when workloads are less active.

Recommendations

Explore the following table of recommendations to optimize your AKS configuration for cost.

Recommendation Benefit
Cluster and workload architectures: Align SKU selection and managed disk size with workload requirements. Matching your selection to your workload demands ensures you don't pay for unneeded resources.
Cluster and workload architectures: Use the Start and Stop feature in Azure Kubernetes Services (AKS). The AKS Stop and Start cluster feature allows AKS customers to pause an AKS cluster, saving time and cost. The stop and start feature keeps cluster configurations in place and customers can pick up where they left off without reconfiguring the clusters.
Cluster architecture: Enable cluster autoscaler to automatically reduce the number of agent nodes in response to excess resource capacity. Automatically scale down the number of nodes in your AKS cluster lets you run an efficient cluster when demand is low, scaling back up when demand returns.
Workload architecture: Consider using Azure Spot VMs for workloads that can handle interruptions, early terminations, and evictions. For example, workloads such as batch processing jobs, development and testing environments, and large compute workloads may be good candidates for you to schedule on a spot node pool. Using spot VMs for nodes with your AKS cluster allows you to take advantage of unused capacity in Azure at a significant cost savings.
Cluster architecture: Enforce resource quotas at the namespace level. Resource quotas provide a way to reserve and limit resources across a development team or project. These quotas are defined on a namespace and can be used to set quotas on compute resources, storage resources, and object counts. When you define resource quotas, all pods created in the namespace must provide limits or requests in their pod specifications.
Workload architecture: Use the Horizontal pod autoscaler. Adjust the number of pods in a deployment depending on CPU utilization or other select metrics, which supports cluster scale-in operations.
Cluster architecture: Configure monitoring of cluster with Container insights. Container insights helps provide actionable insights into your clusters idle and unallocated resources.
Cluster architecture: Select the appropriate region. Due to many factors, cost of resources varies per region in Azure. Evaluate the cost, latency, and compliance requirements to ensure you are running your workload cost-effectively and it doesn't affect your end-users or create additional networking charges.
Cluster architecture: Sign up for Azure Reservations. If you properly planned for capacity, your workload is predictable and will exist for an extended period of time, sign up for Azure Reserved Instances to further reduce your resource costs.
Workload architecture: Maintain small and optimized images. Streamlining your images helps reduce costs since new nodes need to download these images. Build images in a way that allows the container start as soon as possible to help avoid user request failures or timeouts while the application is starting up, potentially leading to overprovisioning.
Cluster architecture: Use Kubernetes Resource Quotas. Resource quotas can be used to limit resource consumption for each namespace in your cluster, and by extension resource utilization for the Azure service.

For more suggestions, see Principles of the cost optimization pillar.

Policy definitions

While there are no built-in policies that are related to cost optimization, custom policies can be created for both the AKS resource and for the Azure Policy add-on for Kubernetes. This allows you to add additional cost optimization constraints you'd like to enforce in your cluster and workload architecture.

Operational excellence

Monitoring and diagnostics are crucial. Not only can you measure performance statistics, but also use metrics troubleshoot and remediate issues quickly. We recommend you review the Operational excellence design principles and the Day-2 operations guide.

When discussing operational excellence with Azure Kubernetes Service, it's important to distinguish between cluster operational excellence and workload operational excellence. Cluster operations is a shared responsibility between the cluster admin and their resource provider, while workload operations is the domain of a developer. Azure Kubernetes Service has considerations and recommendations for both of these roles.

In the design checklist and list of recommendations below, call-outs are made to indicate whether each choice is applicable to cluster architecture, workload architecture, or both.

Design checklist

  • Cluster architecture: Use a template-based deployment using Bicep, Terraform, or others. Make sure that all deployments are repeatable, traceable, and stored in a source code repo.
  • Cluster architecture: Build an automated process to ensure your clusters are bootstrapped with the necessary cluster-wide configurations and deployments. This is often performed using GitOps.
  • Workload architecture: Use a repeatable and automated deployment processes for your workload within your software development lifecycle.
  • Cluster architecture: Enable diagnostics settings to ensure control plane or core API server interactions are logged.
  • Cluster and workload architectures: Enable Container insights to collect metrics, logs, and diagnostics to monitor the availability and performance of the cluster and workloads running on it.
  • Workload architecture: The workload should be designed to emit telemetry that can be collected, which should also include liveliness and readiness statuses.
  • Cluster and workload architectures: Use chaos engineering practices that target Kubernetes to identify application or platform reliability issues.
  • Workload architecture: Optimize your workload to operate and deploy efficiently in a container.
  • Cluster and workload architectures: Enforce cluster and workload governance using Azure Policy.

Recommendations

Explore the following table of recommendations to optimize your AKS configuration for operations.

Recommendation Benefit
Cluster and workload architectures: Review AKS best practices documentation. To build and run applications successfully in AKS, there are key considerations to understand and implement. These areas include multi-tenancy and scheduler features, cluster, and pod security, or business continuity and disaster recovery.
Cluster and workload architectures: Review Azure Chaos Studio. Azure Chaos Studio can help simulate faults and trigger disaster recovery situations.
Cluster and workload architectures: Configure monitoring of cluster with Container insights. Container insights helps monitor the performance of containers by collecting memory and processor metrics from controllers, nodes, and containers that are available in Kubernetes through the Metrics API and container logs.
Workload architecture: Monitor application performance with Azure Monitor. Configure Application Insights for code-based monitoring of applications running in an AKS cluster.
Workload architecture: Configure scraping of Prometheus metrics with Container insights. Container insights, which is part of Azure Monitor, provides a seamless onboarding experience to collect Prometheus metrics. Reference Configure scraping of Prometheus metrics for more information.
Cluster architecture: Adopt a multiregion strategy by deploying AKS clusters deployed across different Azure regions to maximize availability and provide business continuity. Internet facing workloads should leverage Azure Front Door or Azure Traffic Manager to route traffic globally across AKS clusters.
Cluster architecture: Operationalize clusters and pods configuration standards with Azure Policy. Azure Policy can help to apply at-scale enforcement and safeguards on your clusters in a centralized, consistent manner. It can also control what functions pods are granted and if anything is running against company policy.
Workload architecture: Use platform capabilities in your release engineering process. Kubernetes and ingress controllers support many advanced deployment patterns for inclusion in your rleease engineering process. Consider patterns like blue-greem deployments or canary releases.
Cluster and workload architectures: For mission-critical workloads, use stamp-level blue/green deployments. Automate your mission-critical design areas, including deployment and testing.

For more suggestions, see Principles of the operational excellence pillar.

Azure Advisor also makes recommendations on a subset of the items listed in the policy section below, such unsupported AKS versions and unconfigured diagnostic settings. Likewise, it makes workload recommendations around the use of the default namespace.

Policy definitions

Azure Policy offers a variety of built-in policy definitions that apply to both the Azure resource and AKS like standard policy definitions, and using the Azure Policy add-on for Kubernetes, also within the cluster. Many of the Azure resource policies come in both Audit/Deny, but also in a Deploy If Not Exists variant.

There are a numerous number of policies, and key policies related to this pillar are summarized here. For a more detailed view, see built-in policy definitions for Kubernetes.

Cluster architecture

  • Azure Policy add-on for Kubernetes
  • GitOps configuration policies
  • Diagnostics settings policies
  • AKS versions version restrictions
  • Prevent command invoke

Cluster and workload architecture

  • Namespace deployment restrictions

In addition to the built-in policies, custom policies can be created for both the AKS resource and for the Azure Policy add-on for Kubernetes. This allows you to add additional security constraints you'd like to enforce in your cluster and workload architecture.

Performance efficiency

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an efficient manner. We recommend you review the Performance efficiency principles.

When discussing performance with Azure Kubernetes Service, it's important to distinguish between cluster performance and workload performance. Cluster performance is a shared responsibility between the cluster admin and their resource provider, while workload performance is the domain of a developer. Azure Kubernetes Service has considerations and recommendations for both of these roles.

In the design checklist and list of recommendations below, call-outs are made to indicate whether each choice is applicable to cluster architecture, workload architecture, or both.

Design checklist

As you make design choices for Azure Kubernetes Service, review the Performance efficiency principles.

  • Cluster and workload architectures: Perform and iterate on a detailed capacity plan exercise that includes SKU, autoscale settings, IP addressing, and failover considerations.
  • Cluster architecture: Enable cluster autoscaler to automatically adjust the number of agent nodes in response workload demands.
  • Cluster architecture: Use the Horizontal pod autoscaler to adjust the number of pods in a deployment depending on CPU utilization or other select metrics.
  • Cluster and workload architectures: Perform ongoing load testing activities that exercise both the pod and cluster autoscaler.
  • Cluster and workload architectures: Separate workloads into different node pools allowing independent scalling.

Recommendations

Explore the following table of recommendations to optimize your Azure Kubernetes Service configuration for performance.

Recommendation Benefit
Cluster and workload architectures: Develop a detailed capacity plan and continually review and revise. After formalizing your capacity plan, it should be frequently updated by continuously observing the resource utilization of the cluster.
Cluster architecture: Enable cluster autoscaler to automatically adjust the number of agent nodes in response to resource constraints. The ability to automatically scale up or down the number of nodes in your AKS cluster lets you run an efficient, cost-effective cluster.
Cluster and workload architectures: Separate workloads into different node pools and consider scaling user node pools. Unlike System node pools that always require running nodes, user node pools allow you to scale up or down.
Workload architecture: Use AKS advanced scheduler features. Helps control balancing of resources for workloads that require them.
Workload architecture: Use meaningful workload scaling metrics. Not all scale decisions can be derived from CPU or memory metrics. Often times scale considerations will come from more complex or even external data points. Use KEDA to build a meaningful auto scale ruleset based on signals that are specific to your workload.

For more suggestions, see Principles of the performance efficiency pillar.

Policy definitions

Azure Policy offers a variety of built-in policy definitions that apply to both the Azure resource and AKS like standard policy definitions, and using the Azure Policy add-on for Kubernetes, also within the cluster. Many of the Azure resource policies come in both Audit/Deny, but also in a Deploy If Not Exists variant.

There are a numerous number of policies, and key policies related to this pillar are summarized here. For a more detailed view, see built-in policy definitions for Kubernetes.

Cluster and workload architecture

  • CPU and memory resource limits

In addition to the built-in policies, custom policies can be created for both the AKS resource and for the Azure Policy add-on for Kubernetes. This allows you to add additional security constraints you'd like to enforce in your cluster and workload architecture.

Additional resources

Azure Architecture Center guidance

Cloud Adoption Framework guidance

Next steps