Detect data drift (preview) on datasets

APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

In this article, you learn how to create Azure Machine Learning dataset monitors (preview), monitor for data drift and statistical changes in datasets, and set up alerts.

With Azure Machine Learning dataset monitors, you can:

  • Analyze drift in your data to understand how it changes over time.
  • Monitor model data for differences between training and serving datasets.
  • Monitor new data for differences between any baseline and target dataset.
  • Profile features in data to track how statistical properties change over time.
  • Set up alerts on data drift for early warnings to potential issues.

Metrics and insights are available through the Azure Application Insights resource associated with the Azure Machine Learning workspace.

Important

Please note that monitoring data drift with the SDK is available in all editions, while monitoring data drift through the studio on the web is Enterprise edition only.

Prerequisites

To create and work with dataset monitors, you need:

What is data drift?

In the context of machine learning, data drift is the change in model input data that leads to model performance degradation. It is one of the top reasons model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.

Causes of data drift include:

  • Upstream process changes, such as a sensor being replaced that changes the units of measurement from inches to centimeters.
  • Data quality issues, such as a broken sensor always reading 0.
  • Natural drift in the data, such as mean temperature changing with the seasons.
  • Change in relation between features, or covariate shift.

With Azure Machine Learning dataset monitors you can set up alerts that assist in data drift detection in datasets over time.

Dataset monitors

You can create a dataset monitor to detect and alert to data drift on new data in a dataset, analyze historical data for drift, and profile new data over time. The data drift algorithm provides an overall measure of change in data and indication of which features are responsible for further investigation. Dataset monitors produce a number of other metrics by profiling new data in the timeseries dataset. Custom alerting can be set up on all metrics generated by the monitor through Azure Application Insights. Dataset monitors can be used to quickly catch data issues and reduce the time to debug the issue by identifying likely causes.

Conceptually, there are three primary scenarios for setting up dataset monitors in Azure Machine Learning.

Scenario Description
Monitoring a model's serving data for drift from the model's training data Results from this scenario can be interpreted as monitoring a proxy for the model's accuracy, given that model accuracy degrades if the serving data drifts from the training data.
Monitoring a time series dataset for drift from a previous time period. This scenario is more general, and can be used to monitor datasets involved upstream or downstream of model building. The target dataset must have a timestamp column, while the baseline dataset can be any tabular dataset that has features in common with the target dataset.
Performing analysis on past data. This scenario can be used to understand historical data and inform decisions in settings for dataset monitors.

How dataset can monitor data

Using Azure Machine Learning, data drift is monitored through datasets. To monitor for data drift, a baseline dataset - usually the training dataset for a model - is specified. A target dataset - usually model input data - is compared over time to your baseline dataset. This comparison means that your target dataset must have a timestamp column specified.

Set the timeseries trait in the target dataset

The target dataset needs to have the timeseries trait set on it by specifying the timestamp column either from a column in the data or a virtual column derived from the path pattern of the files. This can be done through the Python SDK or Azure Machine Learning studio. A column representing a "fine grain" timestamp must be specified to add timeseries trait to the dataset. If your data is partitioned into folder structure with time info, such as '{yyyy/MM/dd}', you can create a virtual column through the path pattern setting and set it as the "coarse grain" timestamp to improve the importance of time series functionality.

Python SDK

The Dataset class' with_timestamp_columns() method defines the time stamp column for the dataset.

from azureml.core import Workspace, Dataset, Datastore

# get workspace object
ws = Workspace.from_config()

# get datastore object 
dstore = Datastore.get(ws, 'your datastore name')

# specify datastore paths
dstore_paths = [(dstore, 'weather/*/*/*/*/data.parquet')]

# specify partition format
partition_format = 'weather/{state}/{date:yyyy/MM/dd}/data.parquet'

# create the Tabular dataset with 'state' and 'date' as virtual columns 
dset = Dataset.Tabular.from_parquet_files(path=dstore_paths, partition_format=partition_format)

# assign the timestamp attribute to a real or virtual column in the dataset
dset = dset.with_timestamp_columns('date')

# register the dataset as the target dataset
dset = dset.register(ws, 'target')

For a full example of using the timeseries trait of datasets, see the example notebook or the datasets SDK documentation.

Azure Machine Learning studio

Important

The functionality in this studio, https://ml.azure.com, is accessible from Enterprise workspaces only. Learn more about editions and upgrading.

If you create your dataset using Azure Machine Learning studio, ensure the path to your data contains timestamp information, include all subfolders with data, and set the partition format.

In the following example, all data under the subfolder NoaaIsdFlorida/2019 is taken, and the partition format specifies the timestamp's year, month, and day.

Partition format

In the Schema settings, specify the timestamp column from a virtual or real column in the specified dataset:

Timestamp

Dataset monitor settings

Once you create your dataset with the specified timestamp settings, you're ready to configure your dataset monitor.

The various dataset monitor settings are broken into three groups: Basic info, Monitor settings and Backfill settings.

Basic info

This table contains basic settings used for the dataset monitor.

Setting Description Tips Mutable
Name Name of the dataset monitor. No
Baseline dataset Tabular dataset that will be used as the baseline for comparison of the target dataset over time. The baseline dataset must have features in common with the target dataset. Generally, the baseline should be set to a model's training dataset or a slice of the target dataset. No
Target dataset Tabular dataset with timestamp column specified which will be analyzed for data drift. The target dataset must have features in common with the baseline dataset, and should be a timeseries dataset, which new data is appended to. Historical data in the target dataset can be analyzed, or new data can be monitored. No
Frequency The frequency that will be used to schedule the pipeline job and analyze historical data if running a backfill. Options include daily, weekly, or monthly. Adjust this setting to include a comparable size of data to the baseline. No
Features List of features that will be analyzed for data drift over time. Set to a model's output feature(s) to measure concept drift. Do not include features that naturally drift over time (month, year, index, etc.). You can backfill and existing data drift monitor after adjusting the list of features. Yes
Compute target Azure Machine Learning compute target to run the dataset monitor jobs. Yes

Monitor settings

These settings are for the scheduled dataset monitor pipeline, which will be created.

Setting Description Tips Mutable
Enable Enable or disable the schedule on the dataset monitor pipeline Disable the schedule to analyze historical data with the backfill setting. It can be enabled after the dataset monitor is created. Yes
Latency Time, in hours, it takes for data to arrive in the dataset. For instance, if it takes three days for data to arrive in the SQL DB the dataset encapsulates, set the latency to 72. Cannot be changed after the dataset monitor is created No
Email addresses Email addresses for alerting based on breach of the data drift percentage threshold. Emails are sent through Azure Monitor. Yes
Threshold Data drift percentage threshold for email alerting. Further alerts and events can be set on many other metrics in the workspace's associated Application Insights resource. Yes

Backfill settings

These settings are for running a backfill on past data for data drift metrics.

Setting Description Tips
Start date Start date of the backfill job.
End date End date of the backfill job. The end date cannot be more than 31*frequency units of time from the start date. On an existing dataset monitor, metrics can be backfilled to analyze historical data or replace metrics with updated settings.

Create dataset monitors

Create dataset monitors to detect and alert to data drift on a new dataset with Azure Machine Learning studio or the Python SDK.

Azure Machine Learning studio

Important

The functionality in this studio, https://ml.azure.com, is accessible from Enterprise workspaces only. Learn more about editions and upgrading.

To set up alerts on your dataset monitor, the workspace that contains the dataset you want to create a monitor for must have Enterprise edition capabilities.

After the workspace functionality is confirmed, navigate to the studio's homepage and select the Datasets tab on the left. Select Dataset monitors.

Monitor list

Click on the +Create monitor button and continue through the wizard by clicking Next.

Wizard

The resulting dataset monitor will appear in the list. Select it to go to that monitor's details page.

From Python SDK

See the Python SDK reference documentation on data drift for full details.

The following example shows how to create a dataset monitor using the Python SDK

from azureml.core import Workspace, Dataset
from azureml.datadrift import DataDriftDetector
from datetime import datetime

# get the workspace object
ws = Workspace.from_config()

# get the target dataset
dset = Dataset.get_by_name(ws, 'target')

# set the baseline dataset
baseline = target.time_before(datetime(2019, 2, 1))

# set up feature list
features = ['latitude', 'longitude', 'elevation', 'windAngle', 'windSpeed', 'temperature', 'snowDepth', 'stationName', 'countryOrRegion']

# set up data drift detector
monitor = DataDriftDetector.create_from_datasets(ws, 'drift-monitor', baseline, target, 
                                                      compute_target='cpu-cluster', 
                                                      frequency='Week', 
                                                      feature_list=None, 
                                                      drift_threshold=.6, 
                                                      latency=24)

# get data drift detector by name
monitor = DataDriftDetector.get_by_name(ws, 'drift-monitor')

# update data drift detector
monitor = monitor.update(feature_list=features)

# run a backfill for January through May
backfill1 = monitor.backfill(datetime(2019, 1, 1), datetime(2019, 5, 1))

# run a backfill for May through today
backfill1 = monitor.backfill(datetime(2019, 5, 1), datetime.today())

# disable the pipeline schedule for the data drift detector
monitor = monitor.disable_schedule()

# enable the pipeline schedule for the data drift detector
monitor = monitor.enable_schedule()

For a full example of setting up a timeseries dataset and data drift detector, see our example notebook.

Understanding data drift results

The data monitor produces two groups of results: Drift overview and Feature details. The following animation shows the available drift monitor charts based on the selected feature and metric.

Demo video

Drift overview

The Drift overview section contains top-level insights into the magnitude of data drift and which features should be further investigated.

Metric Description Tips
Data drift magnitude Given as a percentage between the baseline and target dataset over time. Ranging from 0 to 100 where 0 indicates identical datasets and 100 indicates the Azure Machine Learning data drift capability can completely tell the two datasets apart. Noise in the precise percentage measured is expected due to machine learning techniques being used to generate this magnitude.
Drift contribution by feature The contribution of each feature in the target dataset to the measured drift magnitude. Due to covariate shift, the underlying distribution of a feature does not necessarily need to change to have relatively high feature importance.

The following image is an example of charts seen in the Drift overview results in Azure Machine Learning studio, resulting from a backfill of NOAA Integrated Surface Data. Data was sampled to stationName contains 'FLORIDA', with January 2019 being used as the baseline dataset and all 2019 data used as the target.

Drift overview

Feature details

The Feature details section contains feature-level insights into the change in the selected feature's distribution, as well as other statistics, over time.

The target dataset is also profiled over time. The statistical distance between the baseline distribution of each feature is compared with the target dataset's over time, which is conceptually similar to the data drift magnitude with the exception that this statistical distance is for an individual feature. Min, max, and mean are also available.

In the Azure Machine Learning studio, if you click on a data point in the graph the distribution of the feature being shown will adjust accordingly. By default, it shows the baseline dataset's distribution and the most recent run's distribution of the same feature.

These metrics can also be retrieved in the Python SDK through the get_metrics() method on a DataDriftDetector object.

Numeric features

Numeric features are profiled in each dataset monitor run. The following are exposed in the Azure Machine Learning studio. Probability density is shown for the distribution.

Metric Description
Wasserstein distance Minimum amount of work to transform baseline distribution into the target distribution.
Mean value Average value of the feature.
Min value Minimum value of the feature.
Max value Maximum value of the feature.

Feature details numeric

Categorical features

Numeric features are profiled in each dataset monitor run. The following are exposed in the Azure Machine Learning studio. A histogram is shown for the distribution.

Metric Description
Euclidian distance Geometric distance between baseline and target distributions.
Unique values Number of unique values (cardinality) of the feature.

Feature details categorical

Metrics, alerts, and events

Metrics can be queried in the Azure Application Insights resource associated with your machine learning workspace. Which gives access to all features of Application Insights including set up for custom alert rules and action groups to trigger an action such as, an Email/SMS/Push/Voice or Azure Function. Please refer to the complete Application Insights documentation for details.

To get started, navigate to the Azure portal and select your workspace's Overview page. The associated Application Insights resource is on the far right:

Azure portal overview

Select Logs (Analytics) under Monitoring on the left pane:

Application insights overview

The dataset monitor metrics are stored as customMetrics. You can write and run a query after setting up a dataset monitor to view them:

Log analytics query

After identifying metrics to set up alert rules, create a new alert rule:

New alert rule

You can use an existing action group, or create a new one to define the action to be taken when the set conditions are met:

New action group

Troubleshooting

Limitations and known issues:

  • Time range of backfill jobs are limited to 31 intervals of the monitor's frequency setting.
  • Limitation of 200 features, unless a feature list is not specified (all features used).
  • Compute size must be large enough to handle the data.
  • Ensure your dataset has data within the start and end date for a given monitor run.
  • Dataset monitors will only work on datasets that contain 50 rows or more.

Columns, or features, in the dataset are classified as categorical or numeric based on the conditions in the following table. If the feature does not meet these conditions - for instance, a column of type string with >100 unique values - the feature is dropped from our data drift algorithm, but is still profiled.

Feature type Data type Condition Limitations
Categorical string, bool, int, float The number of unique values in the feature is less than 100 and less than 5% of the number of rows. Null is treated as its own category.
Numerical int, float The values in the feature are of a numerical data type and do not meet the condition for a categorical feature. Feature dropped if >15% of values are null.

Next steps