How to add and manage data in your Azure AI project

Note

Azure AI Studio is currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

This article shows how to create and manage data in Azure AI Studio. Data can be used as a source for indexing in Azure AI Studio.

And data can help when you need these capabilities:

  • Versioning: Data versioning is supported.
  • Reproducibility: Once you create a data version, it is immutable. It cannot be modified or deleted. Therefore, jobs or prompt flow pipelines that consume the data can be reproduced.
  • Auditability: Because the data version is immutable, you can track the asset versions, who updated a version, and when the version updates occurred.
  • Lineage: For any given data, you can view which jobs or prompt flow pipelines consume the data.
  • Ease-of-use: An Azure AI Studio data resembles web browser bookmarks (favorites). Instead of remembering long storage paths that reference your frequently-used data on Azure Storage, you can create a data version and then access that version of the asset with a friendly name.

Prerequisites

To create and work with data, you need:

  • An Azure subscription. If you don't have one, create a free account before you begin.

  • An Azure AI project in Azure AI Studio.

Create data

When you create your data, you need to set the data type. AI Studio supports three data types:

Type Canonical Scenarios
file
Reference a single file
Read a single file on Azure Storage (the file can have any format).
folder
Reference a folder
Read a folder of parquet/CSV files into Pandas/Spark.

Read unstructured data (images, text, audio, etc.) located in a folder.

The supported source paths are shown in Azure AI Studio. You can create a data from a folder or file:

  • If you select folder type, you can choose the folder URL format. The supported folder URL formats are shown in Azure AI Studio. You can create a data using: Screenshot of folder URL format.

  • If you select file type, you can choose the file URL format. The supported file URL formats are shown in Azure AI Studio. You can create a data using: Screenshot of file URL format.

Create data: File type

A data that is a File (uri_file) type points to a single file on storage (for example, a CSV file). You can create a file typed data using:

These steps explain how to create a File typed data in the Azure AI Studio:

  1. Navigate to Azure AI Studio

  2. From the collapsible menu on the left, select Data under Components. Select Add Data. Screenshot highlights Add Data in the Data tab.

  3. Choose your Data source. You have three options of choosing data source. (a) You can select data from Existing Connections. (b) You can Get data with Storage URL if you have a direct URL to a storage account or a public accessible HTTPS server. (c) You can choose Upload files/folders to upload a folder from your local drive.

    This screenshot shows the existing connections.

    1. Existing Connections: You can select an existing connection and browse into this connection and choose a file you need. If the existing connections don't work for you, you can select the right button to Add connection. This screenshot shows the step to choose a file from existing connection.

    2. Get data with Storage URL: You can choose the Type as "File", and provide a URL based on the supported URL formats listed in the page. This screenshot shows the step to provide a URL pointing to a file.

    3. Upload files/folders: You can select Upload files or folder, and select Upload files, and choose the local file to upload. The file is uploaded into the default "workspaceblobstore" connection. This screenshot shows the step to upload files/folders.

  4. Select Next after choosing the data source.

  5. Enter a custom name for your data, and then select Create.

    Screenshot of naming the data.

Create data: Folder type

A data that is a Folder (uri_folder) type is one that points to a folder on storage (for example, a folder containing several subfolders of images). You can create a folder typed data using:

Use these steps to create a Folder typed data in the Azure AI Studio:

  1. Navigate to Azure AI Studio

  2. From the collapsible menu on the left, select Data under Components. Select Add Data. Screenshot highlights Add Data in the Data tab.

  3. Choose your Data source. You have three options of choosing data source. (a) You can select data from Existing Connections. (b) You can Get data with Storage URL if you have a direct URL to a storage account or a public accessible HTTPS server. (c) You can choose Upload files/folders to upload a folder from your local drive. This screenshot shows the existing connections.

    1. Existing Connections: You can select an existing connection and browse into this connection and choose a file you need. If the existing connections don't work for you, you can select the right button to Add connection. This screenshot shows the step to choose a folder from existing connection.

    2. Get data with Storage URL: You can choose the Type as "Folder", and provide a URL based on the supported URL formats listed in the page. This screenshot shows the step to provide a URL pointing to a folder.

    3. Upload files/folders: You can select Upload files or folder, and select Upload files, and choose the local file to upload. The file is uploaded into the default "workspaceblobstore" connection. This screenshot shows the step to upload files/folders.

  4. Select Next after choosing the data source.

  5. Enter a custom name for your data, and then select Create.

    Screenshot of naming the data.

Manage data

Delete data

Important

By design, data deletion is not supported.

If Azure AI allowed data deletion, it would have the following adverse effects:

  • Production jobs that consume data that were later deleted would fail.
  • It would become more difficult to reproduce an ML experiment.
  • Job lineage would break, because it would become impossible to view the deleted data version.
  • You would not be able to track and audit correctly, since versions could be missing.

Therefore, the immutability of data provides a level of protection when working in a team creating production workloads.

When a data has been erroneously created - for example, with an incorrect name, type or path - Azure AI offers solutions to handle the situation without the negative consequences of deletion:

I want to delete this data because... Solution
The name is incorrect Archive the data
The team no longer uses the data Archive the data
It clutters the data listing Archive the data
The path is incorrect Create a new version of the data (same name) with the correct path. For more information, read Create data.
It has an incorrect type Currently, Azure AI doesn't allow the creation of a new version with a different type compared to the initial version.
(1) Archive the data
(2) Create a new data under a different name with the correct type.

Archive data

Archiving a data hides it by default from both list queries (for example, in the CLI az ml data list) and the data listing in Azure AI Studio. You can still continue to reference and use an archived data in your workflows. You can archive either:

  • all versions of the data under a given name, or
  • a specific data version

Archive all versions of a data

To archive all versions of the data under a given name, use:

Important

Currently, archiving is not supported in Azure AI Studio.

Archive a specific data version

To archive a specific data version, use:

Important

Currently, archiving is not supported in Azure AI Studio.

Restore an archived data

You can restore an archived data. If all of versions of the data are archived, you can't restore individual versions of the data - you must restore all versions.

Restore all versions of a data

To restore all versions of the data under a given name, use:

Important

Currently, restoring archived data is not supported in Azure AI Studio.

Restore a specific data version

Important

If all data versions were archived, you cannot restore individual versions of the data - you must restore all versions.

To restore a specific data version, use:

Important

Currently, restoring a specific data version is not supported in Azure AI Studio.

Data tagging

Data support tagging, which is extra metadata applied to the data in the form of a key-value pair. Data tagging provides many benefits:

  • Data quality description. For example, if your organization uses a medallion lakehouse architecture you can tag assets with medallion:bronze (raw), medallion:silver (validated) and medallion:gold (enriched).
  • Provides efficient searching and filtering of data, to help data discovery.
  • Helps identify sensitive personal data, to properly manage and govern data access. For example, sensitivity:PII/sensitivity:nonPII.
  • Identify whether data is approved from a responsible AI (RAI) audit. For example, RAI_audit:approved/RAI_audit:todo.

You can add tags to existing data.

Data preview

You can browse the folder structure and preview the file in the Data details page. We support data preview for the following types:

  • Data file types will be supported via preview API: ".tsv", ".csv", ".parquet", ".jsonl".
  • Other file types, Studio UI will attempt to preview the file in the browser natively. So the supported file types may depend on the browser itself. Normally for images, these are supported: ".png", ".jpg", ".gif". And normally, these are support ".ipynb", ".py", ".yml", ".html".

Next steps