What are Azure Machine Learning endpoints (preview)?

Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Note

This article utilizes the latest version of the CLI v2 that's in public preview. For guidance to update and install the latest version, see the Install and set up CLI (v2) doc.

Use Azure Machine Learning endpoints (preview) to streamline model deployments for both real-time and batch inference deployments. Endpoints provide a unified interface to invoke and manage model deployments across compute types.

In this article, you learn about:

  • Endpoints
  • Deployments
  • Managed online endpoints
  • Kubernetes online endpoints
  • Batch inference endpoints

What are endpoints and deployments (preview)?

After you train a machine learning model, you need to deploy the model so that others can use it to do inferencing. In Azure Machine Learning, you can use endpoints (preview) and deployments (preview) to do so.

An endpoint is an HTTPS endpoint that clients can call to receive the inferencing (scoring) output of a trained model. It provides:

  • Authentication using "key & token" based auth
  • SSL termination
  • A stable scoring URI (endpoint-name.region.inference.ml.azure.com)

A deployment is a set of resources required for hosting the model that does the actual inferencing.

A single endpoint can contain multiple deployments. Endpoints and deployments are independent Azure Resource Manager resources that appear in the Azure portal.

Azure Machine Learning uses the concept of endpoints and deployments to implement different types of endpoints: online endpoints and batch endpoints.

Multiple developer interfaces

Create and manage batch and online endpoints with multiple developer tools:

  • The Azure CLI
  • Azure Resource Manager/REST API
  • Azure Machine Learning studio web portal
  • Azure portal (IT/Admin)
  • Support for CI/CD MLOps pipelines using the Azure CLI interface & REST/ARM interfaces

What are online endpoints (preview)?

Online endpoints (preview) are endpoints that are used for online (real-time) inferencing. Compared to batch endpoints, online endpoints contain deployments that are ready to receive data from clients and can send responses back in real time.

The following diagram shows an online endpoint that has two deployments, 'blue' and 'green'. The blue deployment uses VMs with a CPU SKU, and runs v1 of a model. The green deployment uses VMs with a GPU SKU, and uses v2 of the model. The endpoint is configured to route 90% of incoming traffic to the blue deployment, while green receives the remaining 10%.

Diagram showing an endpoint splitting traffic to two deployments

Online deployments requirements

To create an online endpoint, you need to specify the following elements:

  • Model files (or specify a registered model in your workspace)
  • Scoring script - code needed to do scoring/inferencing
  • Environment - a Docker image with Conda dependencies, or a dockerfile
  • Compute instance & scale settings

Learn how to deploy online endpoints from the CLI and the studio web portal.

Test and deploy locally for faster debugging

Deploy locally to test your endpoints without deploying to the cloud. Azure Machine Learning creates a local Docker image that mimics the Azure ML image. Azure Machine Learning will build and run deployments for you locally, and cache the image for rapid iterations.

Native blue/green deployment

Recall, that a single endpoint can have multiple deployments. The online endpoint can do load balancing to give any percentage of traffic to each deployment.

Traffic allocation can be used to do safe rollout blue/green deployments by balancing requests between different instances.

Tip

A request can bypass the configured traffic load balancing by including an HTTP header of azureml-model-deployment. Set the header value to the name of the deployment you want the request to route to.

Screenshot showing slider interface to set traffic allocation between deployments

Learn how to safely rollout to online endpoints.

Application Insights integration

All online endpoints integrate with Application Insights to monitor SLAs and diagnose issues.

However managed online endpoints also include out-of-box integration with Azure Logs and Azure Metrics.

Security

  • Authentication: Key and Azure ML Tokens
  • Managed identity: User assigned and system assigned
  • SSL by default for endpoint invocation

Autoscaling

Autoscale automatically runs the right amount of resources to handle the load on your application. Managed endpoints support autoscaling through integration with the Azure monitor autoscale feature. You can configure metrics-based scaling (for instance, CPU utilization >70%), schedule-based scaling (for example, scaling rules for peak business hours), or a combination.

Screenshot showing that autoscale flexibly provides between min and max instances, depending on rules

Visual Studio Code debugging

Visual Studio Code enables you to interactively debug endpoints.

Screenshot of endpoint debugging in VSCode.

Managed online endpoints vs Kubernetes online endpoints (preview)

There are two types of online endpoints: managed online endpoints (preview) and Kubernetes online endpoints (preview). Managed online endpoints help to deploy your ML models in a turnkey manner. Managed online endpoints work with powerful CPU and GPU machines in Azure in a scalable, fully managed way. Managed online endpoints take care of serving, scaling, securing, and monitoring your models, freeing you from the overhead of setting up and managing the underlying infrastructure. The main example in this doc uses managed online endpoints for deployment.

The following table highlights the key differences between managed online endpoints and Kubernetes online endpoints.

Managed online endpoints Kubernetes online endpoints
Recommended users Users who want a managed model deployment and enhanced MLOps experience Users who prefer Kubernetes and can self-manage infrastructure requirements
Infrastructure management Managed compute provisioning, scaling, host OS image updates, and security hardening User responsibility
Compute type Managed (AmlCompute) Kubernetes cluster (Kubernetes)
Out-of-box monitoring Azure Monitoring
(includes key metrics like latency and throughput)
Unsupported
Out-of-box logging Azure Logs and Log Analytics at endpoint level Supported
Application Insights Supported Supported
Managed identity Supported Supported
Virtual Network (VNET) Not supported yet (we're working on it) Supported
View costs Endpoint and deployment level Cluster level

Managed online endpoints

Managed online endpoints can help streamline your deployment process. Managed online endpoints provide the following benefits over Kubernetes online endpoints:

  • Managed infrastructure

    • Automatically provisions the compute and hosts the model (you just need to specify the VM type and scale settings)
    • Automatically updates and patches the underlying host OS image
    • Automatic node recovery if there's a system failure
  • Monitoring and logs

    Screenshot showing Azure Monitor graph of endpoint latency

  • View costs

    Screenshot cost chart of an endpoint and deployment

For a step-by-step tutorial, see How to deploy online endpoints.

What are batch endpoints (preview)?

Batch endpoints (preview) are endpoints that are used to do batch inferencing on large volumes of data over a period of time. Batch endpoints receive pointers to data and run jobs asynchronously to process the data in parallel on compute clusters. Batch endpoints store outputs to a data store for further analysis.

Diagram showing that a single batch endpoint may route requests to multiple deployments, one of which is the default.

Batch deployment requirements

To create a batch deployment, you need to specify the following elements:

  • Model files (or specify a model registered in your workspace)
  • Compute
  • Scoring script - code needed to do the scoring/inferencing
  • Environment - a Docker image with Conda dependencies

If you are deploying MLFlow models, there's no need to provide a scoring script and execution environment, as both are autogenerated.

Learn how to deploy and use batch endpoints with the Azure CLI and the studio web portal

Managed cost with autoscaling compute

Invoking a batch endpoint triggers an asynchronous batch inference job. Compute resources are automatically provisioned when the job starts, and automatically de-allocated as the job completes. So you only pay for compute when you use it.

You can override compute resource settings (like instance count) and advanced settings (like mini batch size, error threshold, and so on) for each individual batch inference job to speed up execution and reduce cost.

Flexible data sources and storage

You can use the following options for input data when invoking a batch endpoint:

Specify the storage output location to any datastore and path. By default, batch endpoints store their output to the workspace's default blob store, organized by the Job Name (a system-generated GUID).

Security

  • Authentication: Azure Active Directory Tokens
  • SSL by default for endpoint invocation

Next steps