Deploy machine learning models to Azure

Learn how to deploy your machine learning or deep learning model as a web service in the Azure cloud.

Tip

Managed online endpoints (preview) provide a way to deploy your trained model without your having to create and manage the underlying infrastructure. For more information, see Deploy and score a machine learning model with a managed online endpoint (preview).

The workflow is similar no matter where you deploy your model:

  1. Register the model
  2. Prepare an entry script
  3. Prepare an inference configuration
  4. Deploy the model locally to ensure everything works
  5. Choose a compute target
  6. Re-deploy the model to the cloud
  7. Test the resulting web service

For more information on the concepts involved in the machine learning deployment workflow, see Manage, deploy, and monitor models with Azure Machine Learning.

Note

Azure Machine Learning Endpoints (preview) provide an improved, simpler deployment experience. Endpoints support both real-time and batch inference scenarios. Endpoints provide a unified interface to invoke and manage model deployments across compute types. See What are Azure Machine Learning endpoints (preview)?.

Prerequisites

Connect to your workspace

Do

az login
az account set -s <my subscription>
az ml workspace list --resource-group=<my resource group>

to see the workspaces you have access to.

Register your model

A typical situation for a deployed machine learning service is that you need the following components:

  • resources representing the specific model that you want deployed (for example: a pytorch model file)
  • code that you will be running in the service, that executes the model on a given input

Azure Machine Learnings allows you to separate the deployment into two separate components, so that you can keep the same code, but merely update the model. We define the mechanism by which you upload a model separately from your code as "registering the model".

When you register a model, we upload the model to the cloud (in your workspace's default storage account) and then mount it to the same compute where your webservice is running.

The following examples demonstrate how to register a model.

Important

You should use only models that you create or obtain from a trusted source. You should treat serialized models as code, because security vulnerabilities have been discovered in a number of popular formats. Also, models might be intentionally trained with malicious intent to provide biased or inaccurate output.

Register a model from a local file

!wget https://aka.ms/bidaf-9-model -o model.onnx
!az ml model register -n bidaf_onnx -p ./model.onnx

Set -p to the path of a folder or a file that you want to register.

For more information on az ml model register, consult the reference documentation.

Register a model from an Azure ML training run

az ml model register -n bidaf_onnx --asset-path outputs/model.onnx --experiment-name myexperiment --run-id myrunid --tag area=qna

Tip

If you get an error message stating that the ml extension isn't installed, use the following command to install it:

az extension add -n azure-cli-ml

The --asset-path parameter refers to the cloud location of the model. In this example, the path of a single file is used. To include multiple files in the model registration, set --asset-path to the path of a folder that contains the files.

For more information on az ml model register, consult the reference documentation.

Define a dummy entry script

The entry script receives data submitted to a deployed web service and passes it to the model. It then returns the model's response to the client. The script is specific to your model. The entry script must understand the data that the model expects and returns.

The two things you need to accomplish in your entry script are:

  1. Loading your model (using a function called init())
  2. Running your model on input data (using a function called run())

For your initial deployment, use a dummy entry script that prints the data it receives.

import json


def init():
    print("This is init")


def run(data):
    test = json.loads(data)
    print(f"received data {test}")
    return f"test is {test}"

Save this file as echo_score.py inside of a directory called source_dir.

So, for example, if a user calls your model with:

curl -X POST -d '{"this":"is a test"}' -H "Content-Type: application/json" http://localhost:6789/score

The following value is returned:

"test is {'this': 'is a test'}"

Define an inference configuration

An inference configuration describes the Docker container and files to use when initializing your web service. All of the files within your source directory, including subdirectories, will be zipped up and uploaded to the cloud when you deploy your web service.

The inference configuration below specifies that the machine learning deployment will use the file echo_score.py in the ./source_dir directory to process incoming requests and that it will use the Docker image with the Python packages specified in the project_environment environment.

You can use any Azure Machine Learning curated environment as the base Docker image when creating your project environment. We will install the required dependencies on top and store the resulting Docker image into the repository that is associated with your workspace.

A minimal inference configuration can be written as:

{
    "entryScript": "echo_score.py",
    "sourceDirectory": "./source_dir",
    "environment": {
        "docker": {
            "arguments": [],
            "baseDockerfile": null,
            "baseImage": "mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04",
            "enabled": false,
            "sharedVolumes": true,
            "shmSize": null
        },
        "environmentVariables": {
            "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
        },
        "name": "my-deploy-env",
        "python": {
            "baseCondaEnvironment": null,
            "condaDependencies": {
                "channels": [],
                "dependencies": [
                    "python=3.6.2",
                    {
                        "pip": [
                            "azureml-defaults"
                        ]
                    }
                ],
                "name": "project_environment"
            },
            "condaDependenciesFile": null,
            "interpreterPath": "python",
            "userManagedDependencies": false
        },
        "version": "1"
    }
}

Save this file with the name dummyinferenceconfig.json.

See this article for a more thorough discussion of inference configurations.

Define a deployment configuration

A deployment configuration specifies the amount of memory and cores to reserve for your webservice will require in order to run, as well as configuration details of the underlying webservice. For example, a deployment configuration lets you specify that your service needs 2 gigabytes of memory, 2 CPU cores, 1 GPU core, and that you want to enable autoscaling.

The options available for a deployment configuration differ depending on the compute target you choose. In a local deployment, all you can specify is which port your webservice will be served on.

The entries in the deploymentconfig.json document map to the parameters for LocalWebservice.deploy_configuration. The following table describes the mapping between the entities in the JSON document and the parameters for the method:

JSON entity Method parameter Description
computeType NA The compute target. For local targets, the value must be local.
port port The local port on which to expose the service's HTTP endpoint.

This JSON is an example deployment configuration for use with the CLI:

{
    "computeType": "local",
    "port": 32267
}

Save this JSON as a file called deploymentconfig.json.

For more information, see this reference.

Deploy your machine learning model

You are now ready to deploy your model.

Replace bidaf_onnx:1 with the name of your model and its version number.

!az ml model deploy -n myservice -m bidaf_onnx:1 --overwrite --ic dummyinferenceconfig.json --dc deploymentconfig.json
!az ml service get-logs -n myservice

Call into your model

Let's check that your echo model deployed successfully. You should be able to do a simple liveness request, as well as a scoring request:

!curl -v http://localhost:32267
!curl -v -X POST -H "content-type:application/json" -d '{"query": "What color is the fox", "context": "The quick brown fox jumped over the lazy dog."}' http://localhost:32267/score

Define an entry script

Now it's time to actually load your model. First, modify your entry script:

import json
import numpy as np
import os
import onnxruntime
from nltk import word_tokenize
import nltk


def init():
    nltk.download("punkt")
    global sess
    sess = onnxruntime.InferenceSession(
        os.path.join(os.getenv("AZUREML_MODEL_DIR"), "model.onnx")
    )


def run(request):
    print(request)
    text = json.loads(request)
    qw, qc = preprocess(text["query"])
    cw, cc = preprocess(text["context"])

    # Run inference
    test = sess.run(
        None,
        {"query_word": qw, "query_char": qc, "context_word": cw, "context_char": cc},
    )
    start = np.asscalar(test[0])
    end = np.asscalar(test[1])
    ans = [w for w in cw[start : end + 1].reshape(-1)]
    print(ans)
    return ans


def preprocess(word):
    tokens = word_tokenize(word)

    # split into lower-case word tokens, in numpy array with shape of (seq, 1)
    words = np.asarray([w.lower() for w in tokens]).reshape(-1, 1)

    # split words into chars, in numpy array with shape of (seq, 1, 1, 16)
    chars = [[c for c in t][:16] for t in tokens]
    chars = [cs + [""] * (16 - len(cs)) for cs in chars]
    chars = np.asarray(chars).reshape(-1, 1, 1, 16)
    return words, chars

Save this file as score.py inside of source_dir.

Notice the use of the AZUREML_MODEL_DIR environment variable to locate your registered model. Now that you've added some pip packages.

{
    "entryScript": "score.py",
    "sourceDirectory": "./source_dir",
    "environment": {
        "docker": {
            "arguments": [],
            "baseDockerfile": null,
            "baseImage": "mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04",
            "enabled": false,
            "sharedVolumes": true,
            "shmSize": null
        },
        "environmentVariables": {
            "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
        },
        "name": "my-deploy-env",
        "python": {
            "baseCondaEnvironment": null,
            "condaDependencies": {
                "channels": [],
                "dependencies": [
                    "python=3.6.2",
                    {
                        "pip": [
                            "azureml-defaults",
                            "nltk",
                            "numpy",
                            "onnxruntime"
                        ]
                    }
                ],
                "name": "project_environment"
            },
            "condaDependenciesFile": null,
            "interpreterPath": "python",
            "userManagedDependencies": false
        },
        "version": "2"
    }
}

Save this file as inferenceconfig.json

Deploy again and call your service

Deploy your service again:


Replace bidaf_onnx:1 with the name of your model and its version number.

!az ml model deploy -n myservice -m bidaf_onnx:1 --overwrite --ic inferenceconfig.json --dc deploymentconfig.json
!az ml service get-logs -n myservice

Then ensure you can send a post request to the service:

!curl -v -X POST -H "content-type:application/json" -d '{"query": "What color is the fox", "context": "The quick brown fox jumped over the lazy dog."}' http://localhost:32267/score

Choose a compute target

Refer to the below diagram when choosing a compute target.

How to choose a compute target

The compute target you use to host your model will affect the cost and availability of your deployed endpoint. Use this table to choose an appropriate compute target.

Compute target Used for GPU support FPGA support Description
Local web service Testing/debugging     Use for limited testing and troubleshooting. Hardware acceleration depends on use of libraries in the local system.
Azure Kubernetes Service (AKS) Real-time inference Yes (web service deployment) Yes Use for high-scale production deployments. Provides fast response time and autoscaling of the deployed service. Cluster autoscaling isn't supported through the Azure Machine Learning SDK. To change the nodes in the AKS cluster, use the UI for your AKS cluster in the Azure portal.

Supported in the designer.
Azure Container Instances Real-time inference     Use for low-scale CPU-based workloads that require less than 48 GB of RAM. Doesn't require you to manage a cluster.

Supported in the designer.
Azure Machine Learning compute clusters Batch inference Yes (machine learning pipeline)   Run batch scoring on serverless compute. Supports normal and low-priority VMs. No support for real-time inference.

Note

Although compute targets like local, and Azure Machine Learning compute clusters support GPU for training and experimentation, using GPU for inference when deployed as a web service is supported only on AKS.

Using a GPU for inference when scoring with a machine learning pipeline is supported only on Azure Machine Learning compute.

When choosing a cluster SKU, first scale up and then scale out. Start with a machine that has 150% of the RAM your model requires, profile the result and find a machine that has the performance you need. Once you've learned that, increase the number of machines to fit your need for concurrent inference.

Note

  • Container instances are suitable only for small models less than 1 GB in size.
  • Use single-node AKS clusters for dev/test of larger models.

Re-deploy to cloud

Once you've confirmed your service works locally and chosen a remote compute target, you are ready to deploy to the cloud.

Change your deploy configuration to correspond to the compute target you've chosen, in this case Azure Container Instances:

The options available for a deployment configuration differ depending on the compute target you choose.

{
    "computeType": "aci",
    "containerResourceRequirements":
    {
        "cpu": 0.5,
        "memoryInGB": 1.0
    },
    "authEnabled": true,
    "sslEnabled": false,
    "appInsightsEnabled": false
}

Save this file as re-deploymentconfig.json.

For more information, see this reference.

Deploy your service again:

Replace bidaf_onnx:1 with the name of your model and its version number.

!az ml model deploy -n myaciservice -m bidaf_onnx:1 --overwrite --ic inferenceconfig.json --dc re-deploymentconfig.json
!az ml service get-logs -n myaciservice

Call your remote webservice

When you deploy remotely, you may have key authentication enabled. The example below shows how to get your service key with Python in order to make an inference request.

import requests
import json
from azureml.core import Webservice

service = Webservice(workspace=ws, name="myservice")
scoring_uri = service.scoring_uri

# If the service is authenticated, set the key or token
key, _ = service.get_keys()

# Set the appropriate headers
headers = {"Content-Type": "application/json"}
headers["Authorization"] = f"Bearer {key}"

# Make the request and display the response and logs
data = {
    "query": "What color is the fox",
    "context": "The quick brown fox jumped over the lazy dog.",
}
data = json.dumps(data)
resp = requests.post(scoring_uri, data=data, headers=headers)
print(resp.text)
print(service.get_logs())

See the article on client applications to consume web services for more example clients in other languages.

Understanding service state

During model deployment, you may see the service state change while it fully deploys.

The following table describes the different service states:

Webservice state Description Final state?
Transitioning The service is in the process of deployment. No
Unhealthy The service has deployed but is currently unreachable. No
Unschedulable The service cannot be deployed at this time due to lack of resources. No
Failed The service has failed to deploy due to an error or crash. Yes
Healthy The service is healthy and the endpoint is available. Yes

Tip

When deploying, Docker images for compute targets are built and loaded from Azure Container Registry (ACR). By default, Azure Machine Learning creates an ACR that uses the basic service tier. Changing the ACR for your workspace to standard or premium tier may reduce the time it takes to build and deploy images to your compute targets. For more information, see Azure Container Registry service tiers.

Note

If you are deploying a model to Azure Kubernetes Service (AKS), we advise you enable Azure Monitor for that cluster. This will help you understand overall cluster health and resource usage. You might also find the following resources useful:

If you are trying to deploy a model to an unhealthy or overloaded cluster, it is expected to experience issues. If you need help troubleshooting AKS cluster problems please contact AKS Support.

Delete resources

# Get the current model id
import os

stream = os.popen(
    'az ml model list --model-name=bidaf_onnx --latest --query "[0].id" -o tsv'
)
MODEL_ID = stream.read()[0:-1]
MODEL_ID
!az ml service delete -n myservice
!az ml service delete -n myaciservice
!az ml model delete --model-id=$MODEL_ID

To delete a deployed webservice, use az ml service delete <name of webservice>.

To delete a registered model from your workspace, use az ml model delete <model id>

Read more about deleting a webservice and deleting a model.

Next steps