Deploy a model to an Azure Kubernetes Service cluster

Learn how to use Azure Machine Learning to deploy a model as a web service on Azure Kubernetes Service (AKS). Azure Kubernetes Service is good for high-scale production deployments. Use Azure Kubernetes service if you need one or more of the following capabilities:

  • Fast response time.
  • Autoscaling of the deployed service.
  • Hardware acceleration options such as GPU and field-programmable gate arrays (FPGA).

Important

Cluter scaling is not provided through the Azure Machine Learning SDK. For more information on scaling the nodes in an AKS cluster, see Scale the node count in an AKS cluster.

When deploying to Azure Kubernetes Service, you deploy to an AKS cluster that is connected to your workspace. There are two ways to connect an AKS cluster to your workspace:

  • Create the AKS cluster using the Azure Machine Learning SDK, the Machine Learning CLI, the Azure portal or workspace landing page (preview). This process automatically connects the cluster to the workspace.
  • Attach an existing AKS cluster to your Azure Machine Learning workspace. A cluster can be attached using the Azure Machine Learning SDK, Machine Learning CLI, or the Azure portal.

Important

The creation or attachment process is a one time task. Once an AKS cluster is connected to the workspace, you can use it for deployments. You can detach or delete the AKS cluster if you no longer need it. Once detatched or deleted, you will no longer be able to deploy to the cluster.

Prerequisites

Create a new AKS cluster

Time estimate: Approximately 20 minutes.

Creating or attaching an AKS cluster is a one time process for your workspace. You can reuse this cluster for multiple deployments. If you delete the cluster or the resource group that contains it, you must create a new cluster the next time you need to deploy. You can have multiple AKS clusters attached to your workspace.

Tip

If you want to secure your AKS cluster using an Azure Virtual Network, you must create the virtual network first. For more information, see Secure experimentation and inference with Azure Virtual Network.

If you want to create an AKS cluster for development, validation, and testing instead of production, you can specify the cluster purpose to dev test.

Warning

If you set cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST, the cluster that is created is not suitable for production level traffic and may increase inference times. Dev/test clusters also do not guarantee fault tolerance. We recommend at least 2 virtual CPUs for dev/test clusters.

The following examples demonstrate how to create a new AKS cluster using the SDK and CLI:

Using the SDK

from azureml.core.compute import AksCompute, ComputeTarget

# Use the default configuration (you can also provide parameters to customize this).
# For example, to create a dev/test cluster, use:
# prov_config = AksCompute.provisioning_configuration(cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST)
prov_config = AksCompute.provisioning_configuration()

aks_name = 'myaks'
# Create the cluster
aks_target = ComputeTarget.create(workspace = ws,
                                    name = aks_name,
                                    provisioning_configuration = prov_config)

# Wait for the create process to complete
aks_target.wait_for_completion(show_output = True)

Important

For provisioning_configuration(), if you pick custom values for agent_count and vm_size, and cluster_purpose is not DEV_TEST, then you need to make sure agent_count multiplied by vm_size is greater than or equal to 12 virtual CPUs. For example, if you use a vm_size of "Standard_D3_v2", which has 4 virtual CPUs, then you should pick an agent_count of 3 or greater.

The Azure Machine Learning SDK does not provide support scaling an AKS cluster. To scale the nodes in the cluster, use the UI for your AKS cluster in the Azure portal. You can only change the node count, not the VM size of the cluster.

For more information on the classes, methods, and parameters used in this example, see the following reference documents:

Using the CLI

az ml computetarget create aks -n myaks

For more information, see the az ml computetarget create ask reference.

Attach an existing AKS cluster

Time estimate: Approximately 5 minutes.

If you already have AKS cluster in your Azure subscription, and it is lower than version 1.14, you can use it to deploy your image.

Tip

The existing AKS cluster can be in a Azure region than your Azure Machine Learning workspace.

If you want to secure your AKS cluster using an Azure Virtual Network, you must create the virtual network first. For more information, see Secure experimentation and inference with Azure Virtual Network.

Warning

When attaching an AKS cluster to a workspace, you can define how you will use the cluster by setting the cluster_purpose parameter.

If you do not set the cluster_purpose parameter, or set cluster_purpose = AksCompute.ClusterPurpose.FAST_PROD, then the cluster must have at least 12 virtual CPUs available.

If you set cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST, then the cluster does not need to have 12 virtual CPUs. We recommend at least 2 virtual CPUs for dev/test. However a cluster that is configured for dev/test is not suitable for production level traffic and may increase inference times. Dev/test clusters also do not guarantee fault tolerance.

For more information on creating an AKS cluster using the Azure CLI or portal, see the following articles:

The following examples demonstrate how to attach an existing AKS cluster to your workspace:

Using the SDK

from azureml.core.compute import AksCompute, ComputeTarget
# Set the resource group that contains the AKS cluster and the cluster name
resource_group = 'myresourcegroup'
cluster_name = 'myexistingcluster'

# Attach the cluster to your workgroup. If the cluster has less than 12 virtual CPUs, use the following instead:
# attach_config = AksCompute.attach_configuration(resource_group = resource_group,
#                                         cluster_name = cluster_name,
#                                         cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST)
attach_config = AksCompute.attach_configuration(resource_group = resource_group,
                                         cluster_name = cluster_name)
aks_target = ComputeTarget.attach(ws, 'myaks', attach_config)

For more information on the classes, methods, and parameters used in this example, see the following reference documents:

Using the CLI

To attach an existing cluster using the CLI, you need to get the resource ID of the existing cluster. To get this value, use the following command. Replace myexistingcluster with the name of your AKS cluster. Replace myresourcegroup with the resource group that contains the cluster:

az aks show -n myexistingcluster -g myresourcegroup --query id

This command returns a value similar to the following text:

/subscriptions/{GUID}/resourcegroups/{myresourcegroup}/providers/Microsoft.ContainerService/managedClusters/{myexistingcluster}

To attach the existing cluster to your workspace, use the following command. Replace aksresourceid with the value returned by the previous command. Replace myresourcegroup with the resource group that contains your workspace. Replace myworkspace with your workspace name.

az ml computetarget attach aks -n myaks -i aksresourceid -g myresourcegroup -w myworkspace

For more information, see the az ml computetarget attach aks reference.

Deploy to AKS

To deploy a model to Azure Kubernetes Service, create a deployment configuration that describes the compute resources needed. For example, number of cores and memory. You also need an inference configuration, which describes the environment needed to host the model and web service. For more information on creating the inference configuration, see How and where to deploy models.

Using the SDK

from azureml.core.webservice import AksWebservice, Webservice
from azureml.core.model import Model

aks_target = AksCompute(ws,"myaks")
# If deploying to a cluster configured for dev/test, ensure that it was created with enough
# cores and memory to handle this deployment configuration. Note that memory is also used by
# things such as dependencies and AML components.
deployment_config = AksWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)
service = Model.deploy(ws, "myservice", [model], inference_config, deployment_config, aks_target)
service.wait_for_deployment(show_output = True)
print(service.state)
print(service.get_logs())

For more information on the classes, methods, and parameters used in this example, see the following reference documents:

Using the CLI

To deploy using the CLI, use the following command. Replace myaks with the name of the AKS compute target. Replace mymodel:1 with the name and version of the registered model. Replace myservice with the name to give this service:

az ml model deploy -ct myaks -m mymodel:1 -n myservice -ic inferenceconfig.json -dc deploymentconfig.json

The entries in the deploymentconfig.json document map to the parameters for AksWebservice.deploy_configuration. The following table describes the mapping between the entities in the JSON document and the parameters for the method:

JSON entity Method parameter Description
computeType NA The compute target. For AKS, the value must be aks.
autoScaler NA Contains configuration elements for autoscale. See the autoscaler table.
  autoscaleEnabled autoscale_enabled Whether to enable autoscaling for the web service. If numReplicas = 0, True; otherwise, False.
  minReplicas autoscale_min_replicas The minimum number of containers to use when autoscaling this web service. Default, 1.
  maxReplicas autoscale_max_replicas The maximum number of containers to use when autoscaling this web service. Default, 10.
  refreshPeriodInSeconds autoscale_refresh_seconds How often the autoscaler attempts to scale this web service. Default, 1.
  targetUtilization autoscale_target_utilization The target utilization (in percent out of 100) that the autoscaler should attempt to maintain for this web service. Default, 70.
dataCollection NA Contains configuration elements for data collection.
  storageEnabled collect_model_data Whether to enable model data collection for the web service. Default, False.
authEnabled auth_enabled Whether or not to enable key authentication for the web service. Both tokenAuthEnabled and authEnabled cannot be True. Default, True.
tokenAuthEnabled token_auth_enabled Whether or not to enable token authentication for the web service. Both tokenAuthEnabled and authEnabled cannot be True. Default, False.
containerResourceRequirements NA Container for the CPU and memory entities.
  cpu cpu_cores The number of CPU cores to allocate for this web service. Defaults, 0.1
  memoryInGB memory_gb The amount of memory (in GB) to allocate for this web service. Default, 0.5
appInsightsEnabled enable_app_insights Whether to enable Application Insights logging for the web service. Default, False.
scoringTimeoutMs scoring_timeout_ms A timeout to enforce for scoring calls to the web service. Default, 60000.
maxConcurrentRequestsPerContainer replica_max_concurrent_requests The maximum concurrent requests per node for this web service. Default, 1.
maxQueueWaitMs max_request_wait_time The maximum time a request will stay in thee queue (in milliseconds) before a 503 error is returned. Default, 500.
numReplicas num_replicas The number of containers to allocate for this web service. No default value. If this parameter is not set, the autoscaler is enabled by default.
keys NA Contains configuration elements for keys.
  primaryKey primary_key A primary auth key to use for this Webservice
  secondaryKey secondary_key A secondary auth key to use for this Webservice
gpuCores gpu_cores The number of GPU cores to allocate for this Webservice. Default is 1. Only supports whole number values.
livenessProbeRequirements NA Contains configuration elements for liveness probe requirements.
  periodSeconds period_seconds How often (in seconds) to perform the liveness probe. Default to 10 seconds. Minimum value is 1.
  initialDelaySeconds initial_delay_seconds Number of seconds after the container has started before liveness probes are initiated. Defaults to 310
  timeoutSeconds timeout_seconds Number of seconds after which the liveness probe times out. Defaults to 2 seconds. Minimum value is 1
  successThreshold success_threshold Minimum consecutive successes for the liveness probe to be considered successful after having failed. Defaults to 1. Minimum value is 1.
  failureThreshold failure_threshold When a Pod starts and the liveness probe fails, Kubernetes will try failureThreshold times before giving up. Defaults to 3. Minimum value is 1.
namespace namespace The Kubernetes namespace that the webservice is deployed into. Up to 63 lowercase alphanumeric ('a'-'z', '0'-'9') and hyphen ('-') characters. The first and last characters can't be hyphens.

The following JSON is an example deployment configuration for use with the CLI:

{
    "computeType": "aks",
    "autoScaler":
    {
        "autoscaleEnabled": true,
        "minReplicas": 1,
        "maxReplicas": 3,
        "refreshPeriodInSeconds": 1,
        "targetUtilization": 70
    },
    "dataCollection":
    {
        "storageEnabled": true
    },
    "authEnabled": true,
    "containerResourceRequirements":
    {
        "cpu": 0.5,
        "memoryInGB": 1.0
    }
}

For more information, see the az ml model deploy reference.

Using VS Code

For information on using VS Code, see deploy to AKS via the VS Code extension.

Important

Deploying through VS Code requires the AKS cluster to be created or attached to your workspace in advance.

Web service authentication

When deploying to Azure Kubernetes Service, key-based authentication is enabled by default. You can also enable token-based authentication. Token-based authentication requires clients to use an Azure Active Directory account to request an authentication token, which is used to make requests to the deployed service.

To disable authentication, set the auth_enabled=False parameter when creating the deployment configuration. The following example disables authentication using the SDK:

deployment_config = AksWebservice.deploy_configuration(cpu_cores=1, memory_gb=1, auth_enabled=False)

For information on authenticating from a client application, see the Consume an Azure Machine Learning model deployed as a web service.

Authentication with keys

If key authentication is enabled, you can use the get_keys method to retrieve a primary and secondary authentication key:

primary, secondary = service.get_keys()
print(primary)

Important

If you need to regenerate a key, use service.regen_key

Authentication with tokens

To enable token authentication, set the token_auth_enabled=True parameter when you are creating or updating a deployment. The following example enables token authentication using the SDK:

deployment_config = AksWebservice.deploy_configuration(cpu_cores=1, memory_gb=1, token_auth_enabled=True)

If token authentication is enabled, you can use the get_token method to retrieve a JWT token and that token's expiration time:

token, refresh_by = service.get_token()
print(token)

Important

You will need to request a new token after the token's refresh_by time.

Microsoft strongly recommends that you create your Azure Machine Learning workspace in the same region as your Azure Kubernetes Service cluster. To authenticate with a token, the web service will make a call to the region in which your Azure Machine Learning workspace is created. If your workspace's region is unavailable, then you will not be able to fetch a token for your web service even, if your cluster is in a different region than your workspace. This effectively results in Token-based Authentication being unavailable until your workspace's region is available again. In addition, the greater the distance between your cluster's region and your workspace's region, the longer it will take to fetch a token.

Update the web service

To update a web service, use the update method. You can update the web service to use a new model, a new entry script, or new dependencies that can be specified in an inference configuration. For more information, see the documentation for Webservice.update.

Important

When you create a new version of a model, you must manually update each service that you want to use it.

Using the SDK

The following code shows how to use the SDK to update the model, environment, and entry script for a web service:

from azureml.core import Environment
from azureml.core.webservice import Webservice
from azureml.core.model import Model, InferenceConfig

# Register new model.
new_model = Model.register(model_path="outputs/sklearn_mnist_model.pkl",
                           model_name="sklearn_mnist",
                           tags={"key": "0.1"},
                           description="test",
                           workspace=ws)

# Use version 3 of the environment.
deploy_env = Environment.get(workspace=ws,name="myenv",version="3")
inference_config = InferenceConfig(entry_script="score.py",
                                   environment=deploy_env)

service_name = 'myservice'
# Retrieve existing service.
service = Webservice(name=service_name, workspace=ws)



# Update to new model(s).
service.update(models=[new_model], inference_config=inference_config)
print(service.state)
print(service.get_logs())

Using the CLI

You can also update a web service by using the ML CLI. The following example demonstrates registering a new model and then updating a web service to use the new model:

az ml model register -n sklearn_mnist  --asset-path outputs/sklearn_mnist_model.pkl  --experiment-name myexperiment --output-metadata-file modelinfo.json
az ml service update -n myservice --model-metadata-file modelinfo.json

Tip

In this example, a JSON document is used to pass the model information from the registration command into the update command.

To update the service to use a new entry script or environment, create an inference configuration file and specify it with the ic parameter.

For more information, see the az ml service update documentation.

Next steps