Introduction to Kubernetes compute target in Azure Machine Learning

Article
12/31/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

With Azure Machine Learning CLI/Python SDK v2, Azure Machine Learning introduced a new compute target - Kubernetes compute target. You can easily enable an existing Azure Kubernetes Service (AKS) cluster or Azure Arc-enabled Kubernetes (Arc Kubernetes) cluster to become a Kubernetes compute target in Azure Machine Learning, and use it to train or deploy models.

In this article, you learn about:

How it works
Usage scenarios
Recommended best practices
KubernetesCompute and legacy AksCompute

How it works

Azure Machine Learning Kubernetes compute supports two kinds of Kubernetes cluster:

AKS cluster in Azure. With your self-managed AKS cluster in Azure, you can gain security and controls to meet compliance requirement and flexibility to manage teams' ML workload.
Arc Kubernetes cluster outside of Azure. With Arc Kubernetes cluster, you can train or deploy models in any infrastructure on-premises, across multicloud, or the edge.

With a simple cluster extension deployment on AKS or Arc Kubernetes cluster, Kubernetes cluster is seamlessly supported in Azure Machine Learning to run training or inference workload. It's easy to enable and use an existing Kubernetes cluster for Azure Machine Learning workload with the following simple steps:

Prepare an Azure Kubernetes Service cluster or Arc Kubernetes cluster.
Deploy the Azure Machine Learning extension.
Attach Kubernetes cluster to your Azure Machine Learning workspace.
Use the Kubernetes compute target from CLI v2, SDK v2, and the Studio UI.

IT-operation team. The IT-operation team is responsible for the first three steps: prepare an AKS or Arc Kubernetes cluster, deploy Azure Machine Learning cluster extension, and attach Kubernetes cluster to Azure Machine Learning workspace. In addition to these essential compute setup steps, IT-operation team also uses familiar tools such as Azure CLI or kubectl to take care of the following tasks for the data-science team:

Network and security configurations, such as outbound proxy server connection or Azure firewall configuration, inference router (azureml-fe) setup, SSL/TLS termination, and virtual network configuration.
Create and manage instance types for different ML workload scenarios and gain efficient compute resource utilization.
Trouble shooting workload issues related to Kubernetes cluster.

Data-science team. Once the IT-operations team finishes compute setup and compute target(s) creation, the data-science team can discover a list of available compute targets and instance types in Azure Machine Learning workspace. These compute resources can be used for training or inference workload. Data science specifies compute target name and instance type name using their preferred tools or APIs. For example, these names could be Azure Machine Learning CLI v2, Python SDK v2, or Studio UI.

Kubernetes usage scenarios

With Arc Kubernetes cluster, you can build, train, and deploy models in any infrastructure on-premises and across multicloud using Kubernetes. This opens some new use patterns previously not possible in cloud setting environment. The following table provides a summary of the new use patterns enabled by Azure Machine Learning Kubernetes compute:

Usage pattern	Location of data	Motivation	Infra setup & Azure Machine Learning implementation
Train model in cloud, deploy model on-premises	Cloud	Make use of cloud compute. Either because of elastic compute needs or special hardware such as a GPU. Model must be deployed on-premises because of security, compliance, or latency requirements	1. Azure managed compute in cloud. 2. Customer managed Kubernetes on-premises. 3. Fully automated MLOps in hybrid mode, including training and model deployment steps transitioning seamlessly from cloud to on-premises and vice versa. 4. Repeatable, with all assets tracked properly. Model retrained when necessary, and model deployment updated automatically after retraining.
Train model on-premises and cloud, deploy to both cloud and on-premises	Cloud	Organizations wanting to combine on-premises investments with cloud scalability. Bring cloud and on-premises compute under single pane of glass. Single source of truth for data is located in cloud, can be replicated to on-premises (that is, lazily on usage or proactively). Cloud compute primary usage is when on-premises resources aren't available (in use, maintenance) or don't have specific hardware requirements (GPU).	1. Azure managed compute in cloud. 2. Customer managed Kubernetes on-premises. 3. Fully automated MLOps in hybrid mode, including training and model deployment steps transitioning seamlessly from cloud to on-premises and vice versa. 4. Repeatable, with all assets tracked properly. Model retrained when necessary, and model deployment updated automatically after retraining.
Train model on-premises, deploy model in cloud	On-premises	Data must remain on-premises due to data-residency requirements. Deploy model in the cloud for global service access or for compute elasticity for scale and throughput.	1. Azure managed compute in cloud. 2. Customer managed Kubernetes on-premises. 3. Fully automated MLOps in hybrid mode, including training and model deployment steps transitioning seamlessly from cloud to on-premises and vice versa. 4. Repeatable, with all assets tracked properly. Model retrained when necessary, and model deployment updated automatically after retraining.
Bring your own AKS in Azure	Cloud	More security and controls. All private IP machine learning to prevent data exfiltration.	1. AKS cluster behind an Azure virtual network. 2. Create private endpoints in the same virtual network for Azure Machine Learning workspace and its associated resources. 3. Fully automated MLOps.
Full ML lifecycle on-premises	On-premises	Secure sensitive data or proprietary IP, such as ML models and code/scripts.	1. Outbound proxy server connection on-premises. 2. Azure ExpressRoute and Azure Arc private link to Azure resources. 3. Customer managed Kubernetes on-premises. 4. Fully automated MLOps.

Limitations

KubernetesCompute target in Azure Machine Learning workloads (training and model inference) has the following limitations:

The availability of Preview features in Azure Machine Learning isn't guaranteed.
- Identified limitation: Models (including the foundational model) from the Model Catalog and Registry aren't supported on Kubernetes online endpoints.

Recommended best practices

Separation of responsibilities between the IT-operations team and data-science team. As we mentioned in the previous section, managing your own compute and infrastructure for ML workload is a complex task. It's best to be done by IT-operations team so data-science team can focus on ML models for organizational efficiency.

Create and manage instance types for different ML workload scenarios. Each ML workload uses different amounts of compute resources such as CPU/GPU and memory. Azure Machine Learning implements instance type as Kubernetes custom resource definition (CRD) with properties of nodeSelector and resource request/limit. With a carefully curated list of instance types, IT-operations can target ML workload on specific node(s) and manage compute resource utilization efficiently.

Multiple Azure Machine Learning workspaces share the same Kubernetes cluster. You can attach Kubernetes cluster multiple times to the same Azure Machine Learning workspace or different Azure Machine Learning workspaces, creating multiple compute targets in one workspace or multiple workspaces. Since many customers organize data science projects around Azure Machine Learning workspace, multiple data science projects can now share the same Kubernetes cluster. This significantly reduces ML infrastructure management overheads and IT cost saving.

Team/project workload isolation using Kubernetes namespace. When you attach Kubernetes cluster to Azure Machine Learning workspace, you can specify a Kubernetes namespace for the compute target. All workloads run by the compute target are placed under the specified namespace.

KubernetesCompute and legacy AksCompute

With Azure Machine Learning CLI/Python SDK v1, you can deploy models on AKS using AksCompute target. Both KubernetesCompute target and AksCompute target support AKS integration, however they support it differently. The following table shows their key differences:

Capabilities	AKS integration with AksCompute (legacy)	AKS integration with KubernetesCompute
CLI/SDK v1	Yes	No
CLI/SDK v2	No	Yes
Training	No	Yes
Real-time inference	Yes	Yes
Batch inference	No	Yes
Real-time inference new features	No new features development	Active roadmap

With these key differences and overall Azure Machine Learning evolution to use SDK/CLI v2, Azure Machine Learning recommends you to use Kubernetes compute target to deploy models if you decide to use AKS for model deployment.

Other resources

Examples

All Azure Machine Learning examples can be found in https://github.com/Azure/azureml-examples.git.

For any Azure Machine Learning example, you only need to update the compute target name to your Kubernetes compute target, then you're all done.

Explore training job samples with CLI v2 - https://github.com/Azure/azureml-examples/tree/main/cli/jobs
Explore model deployment with online endpoint samples with CLI v2 - https://github.com/Azure/azureml-examples/tree/main/cli/endpoints/online/kubernetes
Explore batch endpoint samples with CLI v2 - https://github.com/Azure/azureml-examples/tree/main/cli/endpoints/batch
Explore training job samples with SDK v2 -https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs
Explore model deployment with online endpoint samples with SDK v2 -https://github.com/Azure/azureml-examples/tree/main/sdk/python/endpoints/online/kubernetes