High-performance serving with Triton Inference Server (Preview)

Learn how to use NVIDIA Triton Inference Server to improve the performance of the web service used for model inference.

One of the ways to deploy a model for inference is as a web service. For example, a deployment to Azure Kubernetes Service or Azure Container Instances. By default, Azure Machine Learning uses a single-threaded, general purpose web framework for web service deployments.

Triton is a framework that is optimized for inference. It provides better utilization of GPUs and more cost-effective inference. On the server-side, it batches incoming requests and submits these batches for inference. Batching better utilizes GPU resources, and is a key part of Triton's performance.


Using Triton for deployment from Azure Machine Learning is currently in preview. Preview functionality may not be covered by customer support. For more information, see the Supplemental terms of use for Microsoft Azure previews


The code snippets in this document are for illustrative purposes and may not show a complete solution. For working example code, see the end-to-end samples of Triton in Azure Machine Learning.


NVIDIA Triton Inference Server is an open-source third-party software that is integrated in Azure Machine Learning.


Architectural overview

Before attempting to use Triton for your own model, it's important to understand how it works with Azure Machine Learning and how it compares to a default deployment.

Default deployment without Triton

  • Multiple Gunicorn workers are started to concurrently handle incoming requests.
  • These workers handle pre-processing, calling the model, and post-processing.
  • Clients use the Azure ML scoring URI. For example, https://myservice.azureml.net/score.

Normal, non-triton, deployment architecture diagram

Deploying with Triton directly

  • Requests go directly to the Triton server.
  • Triton processes requests in batches to maximize GPU utilization.
  • The client uses the Triton URI to make requests. For example, https://myservice.azureml.net/v2/models/${MODEL_NAME}/versions/${MODEL_VERSION}/infer.

Inferenceconfig deployment with Triton only, and no Python middleware

Deploying Triton without Python pre- and post-processing

First, follow the steps below to verify that the Triton Inference Server can serve your model.

(Optional) Define a model config file

The model configuration file tells Triton how many inputs to expects and of what dimensions those inputs will be. For more information on creating the configuration file, see Model configuration in the NVIDIA documentation.


We use the --strict-model-config=false option when starting the Triton Inference Server, which means you do not need to provide a config.pbtxt file for ONNX or TensorFlow models.

For more information on this option, see Generated model configuration in the NVIDIA documentation.

Use the correct directory structure

When registering a model with Azure Machine Learning, you can register either individual files or a directory structure. To use Triton, the model registration must be for a directory structure that contains a directory named triton. The general structure of this directory is:

    - triton
        - model_1
            - model_version
                - model_file
            - config_file
        - model_2


This directory structure is a Triton Model Repository and is required for your model(s) to work with Triton. For more information, see Triton Model Repositories in the NVIDIA documentation.

Register your Triton model

az ml model register -n my_triton_model -p models --model-framework=Multi

For more information on az ml model register, consult the reference documentation.

When registering the model in Azure Machine Learning, the value for the --model-path -p parameter must be the name of the parent folder of the Triton Model Repository. In the example above, --model-path is 'models'.

The value for --name -n parameter, my_triton_models in the example, will be the model name known to Azure Machine Learning Workspace.

Deploy your model

If you have a GPU-enabled Azure Kubernetes Service cluster called "aks-gpu" created through Azure Machine Learning, you can use the following command to deploy your model.

az ml model deploy -n triton-webservice -m triton_model:1 --dc deploymentconfig.json --compute-target aks-gpu

See this documentation for more details on deploying models.


Azure Machine Learning Endpoints (preview) provide an improved, simpler deployment experience. Endpoints support both real-time and batch inference scenarios. Endpoints provide a unified interface to invoke and manage model deployments across compute types. See What are Azure Machine Learning endpoints (preview)?.

Call into your deployed model

First, get your scoring URI and bearer tokens.

az ml service show --name=triton-webservice

Then, ensure your service is running by doing:

!curl -v $scoring_uri/v2/health/ready -H 'Authorization: Bearer '"$service_key"''

This command returns information similar to the following. Note the 200 OK; this status means the web server is running.

*   Trying
* Connected to localhost ( port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.71.1
> Accept: */*
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
HTTP/1.1 200 OK

Once you've performed a health check, you can create a client to send data to Triton for inference. For more information on creating a client, see the client examples in the NVIDIA documentation. There are also Python samples at the Triton GitHub.

Clean up resources

If you plan on continuing to use the Azure Machine Learning workspace, but want to get rid of the deployed service, use one of the following options:

az ml service delete -n triton-densenet-onnx


Next steps