Secure data access in Azure Machine Learning

Azure Machine Learning makes it easy to connect to your data in the cloud. It provides an abstraction layer over the underlying storage service, so you can securely access and work with your data without having to write code specific to your storage type. Azure Machine Learning also provides the following data capabilities:

  • Interoperability with Pandas and Spark DataFrames
  • Versioning and tracking of data lineage
  • Data labeling
  • Data drift monitoring

Data workflow

When you're ready to use the data in your cloud-based storage solution, we recommend the following data delivery workflow. This workflow assumes you have an Azure storage account and data in a cloud-based storage service in Azure.

  1. Create an Azure Machine Learning datastore to store connection information to your Azure storage.

  2. From that datastore, create an Azure Machine Learning dataset to point to a specific file(s) in your underlying storage.

  3. To use that dataset in your machine learning experiment you can either

    1. Mount it to your experiment's compute target for model training.

      OR

    2. Consume it directly in Azure Machine Learning solutions like, automated machine learning (automated ML) experiment runs, machine learning pipelines, or the Azure Machine Learning designer.

  4. Create dataset monitors for your model output dataset to detect for data drift.

  5. If data drift is detected, update your input dataset and retrain your model accordingly.

The following diagram provides a visual demonstration of this recommended workflow.

Diagram shows the Azure Storage Service which flows into a datastore, which flows into a dataset. The dataset flows into model training, which flows into data drift, which flows back to dataset.

Connect to storage with datastores

Azure Machine Learning datastores securely keep the connection information to your data storage on Azure, so you don't have to code it in your scripts. Register and create a datastore to easily connect to your storage account, and access the data in your underlying storage service.

Supported cloud-based storage services in Azure that can be registered as datastores:

  • Azure Blob Container
  • Azure File Share
  • Azure Data Lake
  • Azure Data Lake Gen2
  • Azure SQL Database
  • Azure Database for PostgreSQL
  • Databricks File System
  • Azure Database for MySQL

Tip

The generally available functionality for creating datastores requires credential-based authentication for accessing storage services, like a service principal or shared access signature (SAS) token. These credentials can be accessed by users who have Reader access to the workspace.

If this is a concern, create a datastore that uses identity-based data access to storage services (preview). This capability is an experimental preview feature, and may change at any time.

Reference data in storage with datasets

Azure Machine Learning datasets aren't copies of your data. By creating a dataset, you create a reference to the data in its storage service, along with a copy of its metadata.

Because datasets are lazily evaluated, and the data remains in its existing location, you

  • Incur no extra storage cost.
  • Don't risk unintentionally changing your original data sources.
  • Improve ML workflow performance speeds.

To interact with your data in storage, create a dataset to package your data into a consumable object for machine learning tasks. Register the dataset to your workspace to share and reuse it across different experiments without data ingestion complexities.

Datasets can be created from local files, public urls, Azure Open Datasets, or Azure storage services via datastores.

There are 2 types of datasets:

  • A FileDataset references single or multiple files in your datastores or public URLs. If your data is already cleansed and ready to use in training experiments, you can download or mount files referenced by FileDatasets to your compute target.

  • A TabularDataset represents data in a tabular format by parsing the provided file or list of files. You can load a TabularDataset into a pandas or Spark DataFrame for further manipulation and cleansing. For a complete list of data formats you can create TabularDatasets from, see the TabularDatasetFactory class.

Additional datasets capabilities can be found in the following documentation:

Work with your data

With datasets, you can accomplish a number of machine learning tasks through seamless integration with Azure Machine Learning features.

Label data with data labeling projects

Labeling large amounts of data has often been a headache in machine learning projects. Those with a computer vision component, such as image classification or object detection, generally require thousands of images and corresponding labels.

Azure Machine Learning gives you a central location to create, manage, and monitor labeling projects. Labeling projects help coordinate the data, labels, and team members, allowing you to more efficiently manage the labeling tasks. Currently supported tasks are image classification, either multi-label or multi-class, and object identification using bounded boxes.

Create an image labeling project or text labeling project, and output a dataset for use in machine learning experiments.

Monitor model performance with data drift

In the context of machine learning, data drift is the change in model input data that leads to model performance degradation. It is one of the top reasons model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.

See the Create a dataset monitor article, to learn more about how to detect and alert to data drift on new data in a dataset.

Next steps