Configure Kubernetes clusters for machine learning (preview)

Learn how to configure Azure Kubernetes Service (AKS) and Azure Arc-enabled Kubernetes clusters for training and inferencing machine learning workloads.

What is Azure Arc-enabled machine learning?

Azure Arc enables you to run Azure services in any Kubernetes environment, whether it’s on-premises, multicloud, or at the edge.

Azure Arc-enabled machine learning lets you configure and use Azure Kubernetes Service or Azure Arc-enabled Kubernetes clusters to train, inference, and manage machine learning models in Azure Machine Learning.

Machine Learning on Azure Kubernetes Service

To use Azure Kubernetes Service clusters for Azure Machine Learning training and inference workloads, you don't have to connect them to Azure Arc.

Before deploying the Azure Machine Learning extension on Azure Kubernetes Service clusters, you have to:

To deploy the Azure Machine Learning extension on AKS clusters, see the Deploy Azure Machine Learning extension section.

Prerequisites

Azure Kubernetes Service (AKS)

For AKS clusters, connecting them to Azure Arc is optional.

However, you have to register the feature in your cluster. Use the following commands to register the feature:

az feature register --namespace Microsoft.ContainerService -n AKS-ExtensionManager

Azure RedHat OpenShift Service (ARO) and OpenShift Container Platform (OCP) only

  • An ARO or OCP Kubernetes cluster is up and running. For more information, see Create ARO Kubernetes cluster and Create OCP Kubernetes cluster

  • Grant privileged access to AzureML service accounts.

    Run oc edit scc privileged and add the following

    • system:serviceaccount:azure-arc:azure-arc-kube-aad-proxy-sa
    • system:serviceaccount:azureml:{EXTENSION NAME}-kube-state-metrics (Note: {EXTENSION NAME} here must match with the extension name used in az k8s-extension create --name step)
    • system:serviceaccount:azureml:cluster-status-reporter
    • system:serviceaccount:azureml:prom-admission
    • system:serviceaccount:azureml:default
    • system:serviceaccount:azureml:prom-operator
    • system:serviceaccount:azureml:csi-blob-node-sa
    • system:serviceaccount:azureml:csi-blob-controller-sa
    • system:serviceaccount:azureml:load-amlarc-selinux-policy-sa
    • system:serviceaccount:azureml:azureml-fe
    • system:serviceaccount:azureml:prom-prometheus
    • system:serviceaccount:{KUBERNETES-COMPUTE-NAMESPACE}:default

Note

{KUBERNETES-COMPUTE-NAMESPACE} is the namespace of the Kubernetes compute cluster specified in compute attach, which defaults to default. Skip this setting if the namespace is default

Deploy Azure Machine Learning extension

Azure Arc-enabled Kubernetes has a cluster extension functionality that enables you to install various agents including Azure Policy definitions, monitoring, machine learning, and many others. Azure Machine Learning requires the use of the Microsoft.AzureML.Kubernetes cluster extension to deploy the Azure Machine Learning agent on the Kubernetes cluster. Once the Azure Machine Learning extension is installed, you can attach the cluster to an Azure Machine Learning workspace and use it for the following scenarios:

Tip

Train only clusters also support batch inferencing as part of Azure Machine Learning Pipelines.

Use the k8s-extension Azure CLI extension create command to deploy the Azure Machine Learning extension to your Azure Arc-enabled Kubernetes cluster.

Important

Set the --cluster-type parameter to managedClusters to deploy the Azure Machine Learning extension to AKS clusters.

The following configuration settings are available to be used for different Azure Machine Learning extension deployment scenarios.

You can use --config or --config-protected to specify list of key-value pairs for Azure Machine Learning deployment configurations.

Tip

Set the openshift parameter to True to deploy the Azure Machine Learning extension to ARO and OCP Kubernetes clusters.

Configuration Setting Key Name Description Training Inference Training and Inference
enableTraining True or False, default False. Must be set to True for AzureML extension deployment with Machine Learning model training support. N/A
enableInference True or False, default False. Must be set to True for AzureML extension deployment with Machine Learning inference support. N/A
allowInsecureConnections True or False, default False. This must be set to True for AzureML extension deployment with HTTP endpoints support for inference, when sslCertPemFile and sslKeyPemFile are not provided. N/A Optional Optional
privateEndpointNodeport True or False, default False. Must be set to True for AzureML deployment with Machine Learning inference private endpoints support using serviceType nodePort. N/A Optional Optional
privateEndpointILB True or False, default False. Must be set to True for AzureML extension deployment with Machine Learning inference private endpoints support using serviceType internal load balancer N/A Optional Optional
sslSecret The Kubernetes secret under azureml namespace to store cert.pem (PEM-encoded SSL cert) and key.pem (PEM-encoded SSL key), required for AzureML extension deployment with HTTPS endpoint support for inference, when allowInsecureConnections is set to False. Use this config or give static cert and key file path in configuration protected settings. N/A Optional Optional
sslCname A SSL CName to use if enabling SSL validation on the cluster. N/A Optional Optional
inferenceLoadBalancerHA True or False, default True. By default, AzureML extension will deploy three ingress controller replicas for high availability, which requires at least three workers in a cluster. Set this config to False if you have fewer than three workers and want to deploy AzureML extension for development and testing only, in this case it will deploy one ingress controller replica only. N/A Optional Optional
openshift True or False, default False. Set to True if you deploy AzureML extension on ARO or OCP cluster. The deployment process will automatically compile a policy package and load policy package on each node so AzureML services operation can function properly. Optional Optional Optional
nodeSelector Set the node selector so the extension components and the training/inference workloads will only be deployed to the nodes with all specified selectors. Usage: nodeSelector.key=value, support multiple selectors. Example: nodeSelector.node-purpose=worker nodeSelector.node-region=eastus Optional Optional Optional
installNvidiaDevicePlugin True or False, default True. Nvidia Device Plugin is required for ML workloads on Nvidia GPU hardware. By default, AzureML extension deployment will install Nvidia Device Plugin regardless Kubernetes cluster has GPU hardware or not. User can specify this configuration setting to False if Nvidia Device Plugin installation is not required (either it is installed already or there is no plan to use GPU for workload). Optional Optional Optional
reuseExistingPromOp True or False, default False. AzureML extension needs prometheus operator to manage prometheus. Set to True to reuse existing prometheus operator. Optional Optional Optional
logAnalyticsWS True or False, default False. AzureML extension integrates with Azure LogAnalytics Workspace to provide log viewing and analysis capability through LogAnalytics Workspace. This setting must be explicitly set to True if customer wants to use this capability. LogAnalytics Workspace cost may apply. Optional Optional Optional
Configuration Protected Setting Key Name Description Training Inference Training and Inference
sslCertPemFile, sslKeyPemFile Path to SSL certificate and key file (PEM-encoded), required for AzureML extension deployment with HTTPS endpoint support for inference, when allowInsecureConnections is set to False. N/A Optional Optional

Warning

If Nvidia Device Plugin, is already installed in your cluster, reinstalling them may result in an extension installation error. Set installNvidiaDevicePlugin to False to prevent deployment errors.

By default, the deployed Kubernetes deployment resources are randomly deployed to 1 or more nodes on the cluster, and daemonset resources are deployed to all nodes. If you want to restrict the extension deployment to specific nodes, use nodeSelector configuration setting.

Deploy extension for training workloads

Use the following Azure CLI command to deploy the Azure Machine Learning extension and enable training workloads on your Kubernetes cluster:

az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False

Deploy extension for real-time inferencing workloads

Depending on your network setup, Kubernetes distribution variant, and where your Kubernetes cluster is hosted (on-premises or the cloud), choose one of following options to deploy the Azure Machine Learning extension and enable inferencing workloads on your Kubernetes cluster.

Public endpoints support with public load balancer

  • HTTPS

    az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --config enableInference=True sslCname=<cname> --config-protected sslCertPemFile=<path-to-the-SSL-cert-PEM-ile> sslKeyPemFile=<path-to-the-SSL-key-PEM-file> --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False
    
  • HTTP

    Warning

    Public HTTP endpoints support with public load balancer is the least secure way of deploying the Azure Machine Learning extension for real-time inferencing scenarios and is therefore NOT recommended.

    az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name>  --configuration-settings enableInference=True allowInsecureConnections=True --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False
    

Private endpoints support with internal load balancer

  • HTTPS

    az k8s-extension create --name amlarc-compute --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --config enableInference=True privateEndpointILB=True sslCname=<cname> --config-protected sslCertPemFile=<path-to-the-SSL-cert-PEM-ile> sslKeyPemFile=<path-to-the-SSL-key-PEM-file> --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False
    
  • HTTP

    az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --config enableInference=True privateEndpointILB=True allowInsecureConnections=True --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False
    

Endpoints support with NodePort

Using a NodePort gives you the freedom to set up your own load-balancing solution, to configure environments that are not fully supported by Kubernetes, or even to expose one or more nodes' IPs directly.

When you deploy with NodePort service, the scoring url (or swagger url) will be replaced with one of Node IP (for example, http://<NodeIP><NodePort>/<scoring_path>) and remain unchanged even if the Node is unavailable. But you can replace it with any other Node IP.

  • HTTPS

    az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --resource-group <resource-group> --scope cluster --config enableInference=True privateEndpointNodeport=True sslCname=<cname> --config-protected sslCertPemFile=<path-to-the-SSL-cert-PEM-ile> sslKeyPemFile=<path-to-the-SSL-key-PEM-file> --auto-upgrade-minor-version False
    
  • HTTP

    az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --config enableInference=True privateEndpointNodeport=True allowInsecureConnections=Ture --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False
    

Deploy extension for training and inferencing workloads

Use the following Azure CLI command to deploy the Azure Machine Learning extension and enable cluster real-time inferencing, batch-inferencing, and training workloads on your Kubernetes cluster.

az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --config enableTraining=True enableInference=True sslCname=<cname> --config-protected sslCertPemFile=<path-to-the-SSL-cert-PEM-ile> sslKeyPemFile=<path-to-the-SSL-key-PEM-file>--resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False

Resources created during deployment

Once the Azure Machine Learning extension is deployed, the following resources are created in Azure as well as your Kubernetes cluster, depending on the workloads you run on your cluster.

Resource name Resource type Training Inference Training and Inference Description
Azure Service Bus Azure resource Used by gateway to sync job and cluster status to Azure Machine Learning services regularly.
Azure Relay Azure resource Route traffic from Azure Machine Learning services to the Kubernetes cluster.
aml-operator Kubernetes deployment N/A Manage the lifecycle of training jobs.
{EXTENSION-NAME}-kube-state-metrics Kubernetes deployment Export the cluster-related metrics to Prometheus.
{EXTENSION-NAME}-prometheus-operator Kubernetes deployment Provide Kubernetes native deployment and management of Prometheus and related monitoring components.
amlarc-identity-controller Kubernetes deployment N/A Request and renew Blob/Azure Container Registry token with managed identity for infrastructure and user containers.
amlarc-identity-proxy Kubernetes deployment N/A Request and renew Blob/Azure Container Registry token with managed identity for infrastructure and user containers.
azureml-fe Kubernetes deployment N/A The front-end component that routes incoming inference requests to deployed services.
inference-operator-controller-manager Kubernetes deployment N/A Manage the lifecycle of inference endpoints.
metrics-controller-manager Kubernetes deployment Manage the configuration for Prometheus
relayserver Kubernetes deployment Pass the job spec from Azure Machine Learning services to the Kubernetes cluster.
cluster-status-reporter Kubernetes deployment Gather the nodes and resource information, and upload it to Azure Machine Learning services.
nfd-master Kubernetes deployment N/A Node feature discovery.
gateway Kubernetes deployment Send nodes and cluster resource information to Azure Machine Learning services.
csi-blob-controller Kubernetes deployment N/A Azure Blob Storage Container Storage Interface(CSI) driver.
csi-blob-node Kubernetes daemonset N/A Azure Blob Storage Container Storage Interface(CSI) driver.
fluent-bit Kubernetes daemonset Gather infrastructure components' log.
k8s-host-device-plugin-daemonset Kubernetes daemonset Expose fuse to pods on each node.
nfd-worker Kubernetes daemonset N/A Node feature discovery.
prometheus-prom-prometheus Kubernetes statefulset Gather and send job metrics to Azure.
frameworkcontroller Kubernetes statefulset N/A Manage the lifecycle of Azure Machine Learning training pods.
alertmanager Kubernetes statefulset N/A Handle alerts sent by client applications such as the Prometheus server.

Important

Azure Service Bus and Azure Relay resources are under the same resource group as the Arc cluster resource. These resources are used to communicate with the Kubernetes cluster and modifying them will break attached compute targets.

Note

{EXTENSION-NAME} is the extension name specified by the az k8s-extension create --name Azure CLI command.

Verify your AzureML extension deployment

az k8s-extension show --name arcml-extension --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --resource-group <resource-group>

In the response, look for "extensionType": "arcml-extension" and "installState": "Installed". Note it might show "installState": "Pending" for the first few minutes.

When the installState shows Installed, run the following command on your machine with the kubeconfig file pointed to your cluster to check that all pods under azureml namespace are in Running state:

kubectl get pods -n azureml

Update Azure Machine Learning extension

Use k8s-extension update CLI command to update the mutable properties of Azure Machine Learning extension. For more information, see the k8s-extension update CLI command documentation.

  1. Azure Arc supports update of --auto-upgrade-minor-version, --version, --configuration-settings, --configuration-protected-settings.
  2. For configurationSettings, only the settings that require update need to be provided. If the user provides all settings, they would be merged/overwritten with the provided values.
  3. For ConfigurationProtectedSettings, ALL settings should be provided. If some settings are omitted, those settings would be considered obsolete and deleted.

Important

Don't update following configs if you have active training workloads or real-time inference endpoints. Otherwise, the training jobs will be impacted and endpoints unavailable.

  • enableTraining from True to False
  • installNvidiaDevicePlugin from True to False when using GPU.
  • nodeSelector. The update operation can't remove existing nodeSelectors. It can only update existing ones or add new ones.

Don't update following configs if you have active real-time inference endpoints, otherwise, the endpoints will be unavailable.

  • allowInsecureConnections *privateEndpointNodeport *privateEndpointILB
  • To update logAnalyticsWS from True to False, provide all original configurationProtectedSettings. Otherwise, those settings are considered obsolete and deleted.

Delete Azure Machine Learning extension

Use k8s-extension delete CLI command to delete the Azure Machine Learning extension.

It takes around 10 minutes to delete all components deployed to the Kubernetes cluster. Run kubectl get pods -n azureml to check if all components were deleted.

Attach Arc Cluster

Prerequisite

Azure Machine Learning workspace defaults to have a system-assigned managed identity to access Azure ML resources. It's all done if this default setting is applied.

Managed Identity in workspace

Otherwise, if a user-assigned managed identity is specified in Azure Machine Learning workspace creation, the following role assignments need to be granted to the identity manually before attaching the compute.

Azure resource name Role to be assigned
Azure Service Bus Azure Service Bus Data Owner
Azure Relay Azure Relay Owner
Azure Arc-enable Kubernetes Reader

The Azure Service Bus and Azure Relay resource are created under the same Resource Group as the Arc cluster.

Attaching an Azure Arc-enabled Kubernetes cluster makes it available to your workspace for training.

  1. Navigate to Azure Machine Learning studio.

  2. Under Manage, select Compute.

  3. Select the Attached computes tab.

  4. Select +New > Kubernetes (preview)

    Attach Kubernetes cluster

  5. Enter a compute name and select your Azure Arc-enabled Kubernetes cluster from the dropdown.

    • (Optional) Enter Kubernetes namespace, which defaults to default. All machine learning workloads will be sent to the specified kubernetes namespace in the cluster.

    • (Optional) Assign system-assigned or user-assigned managed identity. Managed identities eliminate the need for developers to manage credentials. For more information, see managed identities overview .

    Configure Kubernetes cluster

  6. Select Attach

    In the Attached compute tab, the initial state of your cluster is Creating. When the cluster is successfully attached, the state changes to Succeeded. Otherwise, the state changes to Failed.

    Provision resources

Next steps