Create and attach an Azure Kubernetes Service cluster

APPLIES TO: Python SDK azureml v1

APPLIES TO: Azure CLI ml extension v1

Azure Machine Learning can deploy trained machine learning models to Azure Kubernetes Service. However, you must first either create an Azure Kubernetes Service (AKS) cluster from your Azure ML workspace, or attach an existing AKS cluster. This article provides information on both creating and attaching a cluster.

Prerequisites

Limitations

  • If you need a Standard Load Balancer(SLB) deployed in your cluster instead of a Basic Load Balancer(BLB), create a cluster in the AKS portal/CLI/SDK and then attach it to the AML workspace.

  • If you have an Azure Policy that restricts the creation of Public IP addresses, then AKS cluster creation will fail. AKS requires a Public IP for egress traffic. The egress traffic article also provides guidance to lock down egress traffic from the cluster through the Public IP, except for a few fully qualified domain names. There are 2 ways to enable a Public IP:

    • The cluster can use the Public IP created by default with the BLB or SLB, Or
    • The cluster can be created without a Public IP and then a Public IP is configured with a firewall with a user defined route. For more information, see Customize cluster egress with a user-defined-route.

    The AML control plane does not talk to this Public IP. It talks to the AKS control plane for deployments.

  • To attach an AKS cluster, the service principal/user performing the operation must be assigned the Owner or contributor Azure role-based access control (Azure RBAC) role on the Azure resource group that contains the cluster. The service principal/user must also be assigned Azure Kubernetes Service Cluster Admin Role on the cluster.

  • If you attach an AKS cluster, which has an Authorized IP range enabled to access the API server, enable the AML control plane IP ranges for the AKS cluster. The AML control plane is deployed across paired regions and deploys inference pods on the AKS cluster. Without access to the API server, the inference pods cannot be deployed. Use the IP ranges for both the paired regions when enabling the IP ranges in an AKS cluster.

    Authorized IP ranges only works with Standard Load Balancer.

  • If you want to use a private AKS cluster (using Azure Private Link), you must create the cluster first, and then attach it to the workspace. For more information, see Create a private Azure Kubernetes Service cluster.

  • Using a public fully qualified domain name (FQDN) with a private AKS cluster is not supported with Azure Machine learning.

  • The compute name for the AKS cluster MUST be unique within your Azure ML workspace. It can include letters, digits and dashes. It must start with a letter, end with a letter or digit, and be between 3 and 24 characters in length.

  • If you want to deploy models to GPU nodes or FPGA nodes (or any specific SKU), then you must create a cluster with the specific SKU. There is no support for creating a secondary node pool in an existing cluster and deploying models in the secondary node pool.

  • When creating or attaching a cluster, you can select whether to create the cluster for dev-test or production. If you want to create an AKS cluster for development, validation, and testing instead of production, set the cluster purpose to dev-test. If you do not specify the cluster purpose, a production cluster is created.

    Important

    A dev-test cluster is not suitable for production level traffic and may increase inference times. Dev/test clusters also do not guarantee fault tolerance.

  • When creating or attaching a cluster, if the cluster will be used for production, then it must contain at least 3 nodes. For a dev-test cluster, it must contain at least 1 node.

  • The Azure Machine Learning SDK does not provide support scaling an AKS cluster. To scale the nodes in the cluster, use the UI for your AKS cluster in the Azure Machine Learning studio. You can only change the node count, not the VM size of the cluster. For more information on scaling the nodes in an AKS cluster, see the following articles:

  • Do not directly update the cluster by using a YAML configuration. While Azure Kubernetes Services supports updates via YAML configuration, Azure Machine Learning deployments will override your changes. The only two YAML fields that will not overwritten are request limits and cpu and memory.

  • Creating an AKS cluster using the Azure Machine Learning studio UI, SDK, or CLI extension is not idempotent. Attempting to create the resource again will result in an error that a cluster with the same name already exists.

    • Using an Azure Resource Manager template and the Microsoft.MachineLearningServices/workspaces/computes resource to create an AKS cluster is also not idempotent. If you attempt to use the template again to update an already existing resource, you will receive the same error.

Azure Kubernetes Service version

Azure Kubernetes Service allows you to create a cluster using a variety of Kubernetes versions. For more information on available versions, see supported Kubernetes versions in Azure Kubernetes Service.

When creating an Azure Kubernetes Service cluster using one of the following methods, you do not have a choice in the version of the cluster that is created:

  • Azure Machine Learning studio, or the Azure Machine Learning section of the Azure portal.
  • Machine Learning extension for Azure CLI.
  • Azure Machine Learning SDK.

These methods of creating an AKS cluster use the default version of the cluster. The default version changes over time as new Kubernetes versions become available.

When attaching an existing AKS cluster, we support all currently supported AKS versions.

Important

Azure Kubernetes Service uses Blobfuse FlexVolume driver for the versions <=1.16 and Blob CSI driver for the versions >=1.17. Therefore, it is important to re-deploy or update the web service after cluster upgrade in order to deploy to correct blobfuse method for the cluster version.

Note

There may be edge cases where you have an older cluster that is no longer supported. In this case, the attach operation will return an error and list the currently supported versions.

You can attach preview versions. Preview functionality is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. Support for using preview versions may be limited. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Available and default versions

To find the available and default AKS versions, use the Azure CLI command az aks get-versions. For example, the following command returns the versions available in the West US region:

az aks get-versions -l westus -o table

The output of this command is similar to the following text:

KubernetesVersion    Upgrades
-------------------  ----------------------------------------
1.18.6(preview)      None available
1.18.4(preview)      1.18.6(preview)
1.17.9               1.18.4(preview), 1.18.6(preview)
1.17.7               1.17.9, 1.18.4(preview), 1.18.6(preview)
1.16.13              1.17.7, 1.17.9
1.16.10              1.16.13, 1.17.7, 1.17.9
1.15.12              1.16.10, 1.16.13
1.15.11              1.15.12, 1.16.10, 1.16.13

To find the default version that is used when creating a cluster through Azure Machine Learning, you can use the --query parameter to select the default version:

az aks get-versions -l westus --query "orchestrators[?default == `true`].orchestratorVersion" -o table

The output of this command is similar to the following text:

Result
--------
1.16.13

If you'd like to programmatically check the available versions, use the Container Service Client - List Orchestrators REST API. To find the available versions, look at the entries where orchestratorType is Kubernetes. The associated orchestrationVersion entries contain the available versions that can be attached to your workspace.

To find the default version that is used when creating a cluster through Azure Machine Learning, find the entry where orchestratorType is Kubernetes and default is true. The associated orchestratorVersion value is the default version. The following JSON snippet shows an example entry:

...
 {
        "orchestratorType": "Kubernetes",
        "orchestratorVersion": "1.16.13",
        "default": true,
        "upgrades": [
          {
            "orchestratorType": "",
            "orchestratorVersion": "1.17.7",
            "isPreview": false
          }
        ]
      },
...

Create a new AKS cluster

Time estimate: Approximately 10 minutes.

Creating or attaching an AKS cluster is a one time process for your workspace. You can reuse this cluster for multiple deployments. If you delete the cluster or the resource group that contains it, you must create a new cluster the next time you need to deploy. You can have multiple AKS clusters attached to your workspace.

The following example demonstrates how to create a new AKS cluster using the SDK and CLI:

APPLIES TO: Python SDK azureml v1

from azureml.core.compute import AksCompute, ComputeTarget

# Use the default configuration (you can also provide parameters to customize this).
# For example, to create a dev/test cluster, use:
# prov_config = AksCompute.provisioning_configuration(cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST)
prov_config = AksCompute.provisioning_configuration()

# Example configuration to use an existing virtual network
# prov_config.vnet_name = "mynetwork"
# prov_config.vnet_resourcegroup_name = "mygroup"
# prov_config.subnet_name = "default"
# prov_config.service_cidr = "10.0.0.0/16"
# prov_config.dns_service_ip = "10.0.0.10"
# prov_config.docker_bridge_cidr = "172.17.0.1/16"

aks_name = 'myaks'
# Create the cluster
aks_target = ComputeTarget.create(workspace = ws,
                                    name = aks_name,
                                    provisioning_configuration = prov_config)

# Wait for the create process to complete
aks_target.wait_for_completion(show_output = True)

For more information on the classes, methods, and parameters used in this example, see the following reference documents:

Attach an existing AKS cluster

Time estimate: Approximately 5 minutes.

If you already have AKS cluster in your Azure subscription, you can use it with your workspace.

Tip

The existing AKS cluster can be in a Azure region other than your Azure Machine Learning workspace.

Warning

Do not create multiple, simultaneous attachments to the same AKS cluster from your workspace. For example, attaching one AKS cluster to a workspace using two different names. Each new attachment will break the previous existing attachment(s).

If you want to re-attach an AKS cluster, for example to change TLS or other cluster configuration setting, you must first remove the existing attachment by using AksCompute.detach().

For more information on creating an AKS cluster using the Azure CLI or portal, see the following articles:

The following example demonstrates how to attach an existing AKS cluster to your workspace:

APPLIES TO: Python SDK azureml v1

from azureml.core.compute import AksCompute, ComputeTarget
# Set the resource group that contains the AKS cluster and the cluster name
resource_group = 'myresourcegroup'
cluster_name = 'myexistingcluster'

# Attach the cluster to your workgroup. If the cluster has less than 12 virtual CPUs, use the following instead:
# attach_config = AksCompute.attach_configuration(resource_group = resource_group,
#                                         cluster_name = cluster_name,
#                                         cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST)
attach_config = AksCompute.attach_configuration(resource_group = resource_group,
                                         cluster_name = cluster_name)
aks_target = ComputeTarget.attach(ws, 'myaks', attach_config)

# Wait for the attach process to complete
aks_target.wait_for_completion(show_output = True)

For more information on the classes, methods, and parameters used in this example, see the following reference documents:

Create or attach an AKS cluster with TLS termination

When you create or attach an AKS cluster, you can enable TLS termination with AksCompute.provisioning_configuration() and AksCompute.attach_configuration() configuration objects. Both methods return a configuration object that has an enable_ssl method, and you can use enable_ssl method to enable TLS.

Following example shows how to enable TLS termination with automatic TLS certificate generation and configuration by using Microsoft certificate under the hood.

APPLIES TO: Python SDK azureml v1

   from azureml.core.compute import AksCompute, ComputeTarget
   
   # Enable TLS termination when you create an AKS cluster by using provisioning_config object enable_ssl method

   # Leaf domain label generates a name using the formula
   # "<leaf-domain-label>######.<azure-region>.cloudapp.azure.com"
   # where "######" is a random series of characters
   provisioning_config.enable_ssl(leaf_domain_label = "contoso")
   
   # Enable TLS termination when you attach an AKS cluster by using attach_config object enable_ssl method

   # Leaf domain label generates a name using the formula
   # "<leaf-domain-label>######.<azure-region>.cloudapp.azure.com"
   # where "######" is a random series of characters
   attach_config.enable_ssl(leaf_domain_label = "contoso")


Following example shows how to enable TLS termination with custom certificate and custom domain name. With custom domain and certificate, you must update your DNS record to point to the IP address of scoring endpoint, please see Update your DNS

APPLIES TO: Python SDK azureml v1

   from azureml.core.compute import AksCompute, ComputeTarget

   # Enable TLS termination with custom certificate and custom domain when creating an AKS cluster
   
   provisioning_config.enable_ssl(ssl_cert_pem_file="cert.pem",
                                        ssl_key_pem_file="key.pem", ssl_cname="www.contoso.com")
    
   # Enable TLS termination with custom certificate and custom domain when attaching an AKS cluster

   attach_config.enable_ssl(ssl_cert_pem_file="cert.pem",
                                        ssl_key_pem_file="key.pem", ssl_cname="www.contoso.com")


Note

For more information about how to secure model deployment on AKS cluster, please see use TLS to secure a web service through Azure Machine Learning

Create or attach an AKS cluster to use Internal Load Balancer with private IP

When you create or attach an AKS cluster, you can configure the cluster to use an Internal Load Balancer. With an Internal Load Balancer, scoring endpoints for your deployments to AKS will use a private IP within the virtual network. Following code snippets show how to configure an Internal Load Balancer for an AKS cluster.

APPLIES TO: Python SDK azureml v1

To create an AKS cluster that uses an Internal Load Balancer, use the load_balancer_type and load_balancer_subnet parameters:

from azureml.core.compute.aks import AksUpdateConfiguration
from azureml.core.compute import AksCompute, ComputeTarget

# Change to the name of the subnet that contains AKS
subnet_name = "default"
# When you create an AKS cluster, you can specify Internal Load Balancer to be created with provisioning_config object
provisioning_config = AksCompute.provisioning_configuration(load_balancer_type = 'InternalLoadBalancer', load_balancer_subnet = subnet_name)

# Create the cluster
aks_target = ComputeTarget.create(workspace = ws,
                                name = aks_name,
                                provisioning_configuration = provisioning_config)

# Wait for the create process to complete
aks_target.wait_for_completion(show_output = True)

Important

If your AKS cluster is configured with an Internal Load Balancer, using a Microsoft provided certificate is not supported and you must use custom certificate to enable TLS.

Note

For more information about how to secure inferencing environment, please see Secure an Azure Machine Learning Inferencing Environment

Detach an AKS cluster

To detach a cluster from your workspace, use one of the following methods:

Warning

Using the Azure Machine Learning studio, SDK, or the Azure CLI extension for machine learning to detach an AKS cluster does not delete the AKS cluster. To delete the cluster, see Use the Azure CLI with AKS.

APPLIES TO: Python SDK azureml v1

aks_target.detach()

Troubleshooting

Update the cluster

Updates to Azure Machine Learning components installed in an Azure Kubernetes Service cluster must be manually applied.

You can apply these updates by detaching the cluster from the Azure Machine Learning workspace and reattaching the cluster to the workspace.

APPLIES TO: Python SDK azureml v1

compute_target = ComputeTarget(workspace=ws, name=clusterWorkspaceName)
compute_target.detach()
compute_target.wait_for_completion(show_output=True)

Before you can re-attach the cluster to your workspace, you need to first delete any azureml-fe related resources. If there is no active service in the cluster, you can delete your azureml-fe related resources with the following code.

kubectl delete sa azureml-fe
kubectl delete clusterrole azureml-fe-role
kubectl delete clusterrolebinding azureml-fe-binding
kubectl delete svc azureml-fe
kubectl delete svc azureml-fe-int-http
kubectl delete deploy azureml-fe
kubectl delete secret azuremlfessl
kubectl delete cm azuremlfeconfig

If TLS is enabled in the cluster, you will need to supply the TLS/SSL certificate and private key when reattaching the cluster.

APPLIES TO: Python SDK azureml v1

attach_config = AksCompute.attach_configuration(resource_group=resourceGroup, cluster_name=kubernetesClusterName)

# If SSL is enabled.
attach_config.enable_ssl(
    ssl_cert_pem_file="cert.pem",
    ssl_key_pem_file="key.pem",
    ssl_cname=sslCname)

attach_config.validate_configuration()

compute_target = ComputeTarget.attach(workspace=ws, name=args.clusterWorkspaceName, attach_configuration=attach_config)
compute_target.wait_for_completion(show_output=True)

If you no longer have the TLS/SSL certificate and private key, or you are using a certificate generated by Azure Machine Learning, you can retrieve the files prior to detaching the cluster by connecting to the cluster using kubectl and retrieving the secret azuremlfessl.

kubectl get secret/azuremlfessl -o yaml

Note

Kubernetes stores the secrets in Base64-encoded format. You will need to Base64-decode the cert.pem and key.pem components of the secrets prior to providing them to attach_config.enable_ssl.

Webservice failures

Many webservice failures in AKS can be debugged by connecting to the cluster using kubectl. You can get the kubeconfig.json for an AKS cluster by running

APPLIES TO: Azure CLI ml extension v1

az aks get-credentials -g <rg> -n <aks cluster name>

After detaching cluster, if there is none active service in cluster, please delete the azureml-fe related resources before attaching again:

kubectl delete sa azureml-fe
kubectl delete clusterrole azureml-fe-role
kubectl delete clusterrolebinding azureml-fe-binding
kubectl delete svc azureml-fe
kubectl delete svc azureml-fe-int-http
kubectl delete deploy azureml-fe
kubectl delete secret azuremlfessl
kubectl delete cm azuremlfeconfig

Load balancers should not have public IPs

When trying to create or attach an AKS cluster, you may receive a message that the request has been denied because "Load Balancers should not have public IPs". This message is returned when an administrator has applied a policy that prevents using an AKS cluster with a public IP address.

To resolve this problem, create/attach the cluster by using the load_balancer_type and load_balancer_subnet parameters. For more information, see Internal Load Balancer (private IP).

Next steps