Operations management considerations for Azure Kubernetes Service
Kubernetes is a relatively new technology, rapidly evolving with an impressive ecosystem. As such, it can be challenging to manage and protect it.
Operations baseline for AKS
By properly designing your Azure Kubernetes Service (AKS) solution with management and monitoring in mind, you can work toward operational excellence and customer success.
Design considerations
Consider the following factors:
- Be aware of AKS limits. Use multiple AKS instances to scale beyond those limits.
- Be aware of ways to isolate workloads logically within a cluster and physically in separate clusters.
- Be aware of ways to control resource consumption by workloads.
- Be aware of ways to help Kubernetes understand the health of your workloads.
- Be aware of various virtual machines sizes and the impact of using one or the other. Larger VMs can handle more load. Smaller VMs can easier be replaced by others when unavailable for planned and unplanned maintenance. Also be aware of the concept of node pools, VMs in a scale set, which allows to have virtual machines of a different size in the same cluster. Larger VMs are more optimal because AKS reserves a smaller percentage of its resources making more of its resources available for your workloads.
- Be aware of ways to monitor and log AKS. Kubernetes consists of various components and monitoring and logging should provide an insight of its health, trends as well as potential issues.
- Building on monitoring and logging there can be many events generated by Kubernetes or applications running on top. Alerts can help differentiate between log entries for historical purposes and those that require immediate action.
- Be aware of updates and upgrades that you should do. At the Kubernetes level there are major, minor and patch versions. The customer should apply these update, to remain in a supported state according to the policy in upstream Kubernetes. At the worker host level OS kernel patches may require a reboot, which the customer should do, as well as upgrades to new OS versions. In addition to manually upgrading a cluster, you can set an auto-upgrade channel on your cluster.
- Be aware of resource limitations of the cluster as well as individual workloads.
- Be aware of the differences between horizontal pod autoscaler and cluster autoscaler
- Consider securing traffic between pods using network policies and the Azure policies plug-in
- To help troubleshoot your application and services running on AKS, you may need to view the logs generated by control plane components. You may want to enable resource logs for AKS since logging is not enabled by default.
Recommendations
- Understand AKS limits:
- Use logical isolation at the namespace level to separate applications, teams, environments, business units. Multitenancy and cluster isolation. Also node pools can help at nodes with different node specifications, and maintenance like Kubernetes upgrades multiple node pools
- Plan and apply resource quotas at the namespace level. If pods don't define resource requests and limits, reject the deployment using policies, and so on. This does not apply to kube-system pods, since not all kube-system pods have requests and limits. Monitor resource usage and adjust quotas as needed. Basic scheduler features
- Add health probes to your pods. Make sure pods contain
livenessProbe,readinessProbe, andstartupProbeAKS health probes. - Use VM sizes big enough to contain multiple container instances so you get the benefits of increased density, but not so big that your cluster can't handle the workload of a failing node.
- Use a monitoring solution. Azure Monitor for containers is set up by default and provides easy access to many insights. You can use Prometheus integration if you want to drill deeper or have experience using Prometheus. If you also want to run a monitoring application on AKS, you should also use Azure Monitor to monitor that application.
- Use an alerting system to provide notifications when things need direct action. Metric alerts
- Use automatic node pool scaling feature together with horizontal pod autoscaler to meet application demands and to mitigate peak hours loads.
- Use Azure Advisor to get best practice recommendations on cost, security, reliability, operational excellence and performance. Also use Microsoft Defender for Cloud to prevent and detect threats like image vulnerabilities.
- Use Azure Arc enabled Kubernetes to manage non-AKS Kubernetes clusters in Azure using Azure Policy, Defender for Cloud, GitOps, and so on.
- Use pod requests and limits to manage the compute resources within an AKS cluster. Pod requests and limits inform the Kubernetes scheduler which compute resources to assign to a pod.
Business continuity / disaster recovery to protect and recover AKS
Your organization needs to design suitable Azure Kubernetes Service (AKS) platform-level capabilities to meet its specific requirements. These application services have requirements related to recovery time objective (RTO) and recovery point objective (RPO). There are multiple considerations to address for AKS disaster recovery. Your first step is to define a service-level agreement (SLA) for your infrastructure and application. Learn about the SLA for Azure Kubernetes Service (AKS). See the SLA details section for information about monthly uptime calculations.
Design considerations
Consider the following factors:
The AKS cluster should use multiple nodes in a node pool to provide the minimum level of availability for your application.
Set pod requests and limits. Setting these limits lets Kubernetes:
Efficiently give CPU and memory resources to the pods.
Have higher container density on a node.
Limits can also increase reliability with reduced costs because of better use of hardware.
AKS suitability for Availability Zones or availability sets.
Choose a region that supports Availability Zones.
Availability Zones can only be set when the node pool is created and can't be changed later. Multizone support only applies to node pools.
For complete zonal benefit, all service dependencies must also support zones. If a dependent service doesn't support zones, it's possible that a zone failure could cause that service to fail.
For higher availability beyond what Availability Zones can achieve, run multiple AKS clusters in different paired regions. If an Azure resource supports geo-redundancy, provide the location where the redundant service will have its secondary region.
You should be aware of guidelines for disaster recovery in AKS. Then consider whether they apply to the AKS clusters that you use for Azure Dev Spaces.
Consistently create backups for applications and data.
A non-stateful service can be replicated efficiently.
If you need to store state in the cluster, back up the data frequently in the paired region. One consideration is to store state in the cluster properly can be complicated.
Cluster update and maintenance.
Always keep your cluster up to date.
Be aware of the release and deprecation process.
Plan your updates and maintenance in advance.
Network connectivity if a failover occurs.
Choose a traffic router that can distribute traffic across zones or regions, depending on your requirement. This architecture deploys Azure Load Balancer because it can distribute non-web traffic across zones.
If you need to distribute traffic across regions, consider using Azure Front Door.
Planned and unplanned failovers.
- When setting up each Azure service, choose features that support disaster recovery. For example, in this architecture, enable Azure Container Registry for geo-replication. If a region goes down, you can still pull images from the replicated region.
Maintain engineering DevOps capabilities to reach service level goals.
Determine whether you need a financially backed SLA for your Kubernetes API server endpoint.
Design recommendations
The following are best practices for your design:
Use three nodes for the system node pool. For the user node pool, start with no less than two nodes. If you need higher availability, set up more nodes.
Isolate your application from the system services by placing it in a separate node pool. This way, Kubernetes services run on dedicated nodes and don't compete with other services. Use tags, labels, and taints to identify the node pool to schedule your workload.
Regular upkeep of your cluster, for example, making timely updates, is crucial for reliability. Be mindful of the support window for Kubernetes versions on AKS and plan your updates in advance. Also, monitoring the health of the pods through probes is recommended.
Whenever possible, remove service state from inside containers. Instead, use an Azure platform as a service (PaaS) that supports multiregion replication.
Ensure pod resources. It's highly recommended that deployments specify pod resource requirements. The scheduler can then appropriately schedule the pod. Reliability depreciates significantly when pods aren't scheduled.
Set up multiple replicas in the deployment to handle disruptions like hardware failures. For planned events like updates and upgrades, a disruption budget can ensure the required number of pod replicas exist to handle expected application load.
Your applications might use Azure Storage for their data. Because your applications are spread across multiple AKS clusters in different regions, you need to keep the storage synced. Here are two common ways to replicate storage:
Infrastructure-based asynchronous replication
Application-based asynchronous replication
Estimate pod limits. Test and establish a baseline. Start with equal values for requests and limits. Then, gradually tune those values until you've established a threshold that can cause instability in the cluster. Pod limits can be specified in your deployment manifests.
The built-in features provide a solution to the complex task of handling failures and disruptions in service architecture. These configurations help to simplify both design and deployment automation. When an organization has defined a standard for the SLA, RTO, and RPO, it can use built-in services to Kubernetes and Azure to achieve its business goals.
Set pod disruption budgets. This setting checks how many replicas in a deployment you can take down during an update or upgrade event.
Enforce resource quotas on the service namespaces. The resource quota on a namespace will ensure pod requests and limits are properly set on a deployment.
- Setting resources quotas at the cluster level can cause problems when deploying partner services that don't have proper requests and limits.
Store your container images in Azure Container Registry and geo-replicate the registry to each AKS region.
Use the Uptime SLA to enable a financially backed, higher SLA for all clusters hosting production workloads. Uptime SLA guarantees 99.95% availability of the Kubernetes API server endpoint for clusters that use Availability Zones and 99.9% of availability for clusters that don't use Availability Zones. Your nodes, node pools, and other resources are covered under their own SLA. AKS also offers a free tier with a service level objective (SLO) of 99.5% for its control plane components. Clusters without the Uptime SLA enabled should not be used for production workloads.
Use multiple regions and peering locations for Azure ExpressRoute connectivity.
If an outage affecting an Azure region or peering provider location occurs, a redundant hybrid network architecture can help ensure uninterrupted cross-premises connectivity.
Interconnect regions with global virtual network peering. If the clusters need to talk to each other, connecting both virtual networks to each other can be achieved through virtual network peering. This technology interconnects virtual networks to each other providing high bandwidth across Microsoft's backbone network, even across different geographic regions.
Using split TCP-based anycast protocol, Azure Front Door ensures that your end users promptly connect to the nearest Front Door point of presence. Other features of Azure Front Door include:
TLS termination
Custom domain
Web Application Firewall
URL rewrite
Session affinity
Review the needs of your application traffic to learn which solution is the most suitable.
Povratne informacije
Pošalјite i prikažite povratne informacije za