Dataset class

Definition

The Dataset class is a resource for exploring, transforming and managing data in Azure Machine Learning.

You can explore your data with summary statistics and transform it using intelligent transforms. When you are ready to use the data for training, you can save the Dataset to your AzureML workspace to get versioning and reproducibility capabilities.

To learn more about Azure ML Datasets, go to: https://aka.ms/azureml/concepts/datasets.

Dataset(definition, workspace=None, name=None, id=None)
Inheritance
builtins.object
Dataset

Methods

archive()

Archive the Dataset.

auto_read_files(path, include_path=False)

Analyzes the file(s) at the specified path and returns a new Dataset.

compare_profiles(rhs_dataset, profile_arguments={}, include_columns=None, exclude_columns=None, histogram_compare_method=<HistogramCompareMethod.WASSERSTEIN: 0>)

Compare the current Dataset's profile with rhs_dataset profile.

create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)

Create a snapshot of the registered Dataset.

delete_snapshot(snapshot_name)

Delete snapshot of the Dataset by name.

deprecate(deprecate_by_dataset_id)

Deprecate the Dataset, with a pointer to the new Dataset.

diff(rhs_dataset, compute_target=None, columns=None)

Diff the current Dataset with rhs_dataset.

from_binary_files(path)

Create unregistered, in-memory Dataset from binary files.

from_delimited_files(path, separator=',', header=<PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS: 3>, encoding=<FileEncoding.UTF8: 0>, quoting=False, infer_column_types=True, skip_rows=0, skip_mode=<SkipLinesBehavior.NO_ROWS: 0>, comment=None, include_path=False, archive_options=None)

Create unregistered, in-memory Dataset from delimited files.

from_excel_files(path, sheet_name=None, use_column_headers=False, skip_rows=0, include_path=False, infer_column_types=True)

Create unregistered, in-memory Dataset from Excel files.

from_json_files(path, encoding=<FileEncoding.UTF8: 0>, flatten_nested_arrays=False, include_path=False)

Create unregistered, in-memory Dataset from json files.

from_pandas_dataframe(dataframe, path=None, in_memory=False)

Create unregistered, in-memory Dataset from pandas dataframe.

from_parquet_files(path, include_path=False)

Create unregistered, in-memory Dataset from parquet files.

from_sql_query(data_source, query)

Create unregistered, in-memory Dataset from sql query.

generate_profile(compute_target=None, workspace=None, arguments=None)

Generate new profile for the Dataset.

get(workspace, name=None, id=None)

Get a Dataset that already exists in the workspace by specifying either its name or id.

get_all_snapshots()

Get all snapshots of the Dataset.

get_definition(version_id=None)

Get a specific definition of the Dataset.

get_definitions()

Get all the definitions of the Dataset.

get_profile(arguments=None, generate_if_not_exist=True, workspace=None, compute_target=None)

Get summary statistics on the Dataset computed earlier.

get_snapshot(snapshot_name)

Get snapshot of the Dataset by name.

head(count)

Pull the specified number of records specified from this Dataset and returns them as a DataFrame.

list(workspace)

List all of the Datasets in the workspace, including ones with is_visible=False.

reactivate()

Reactivate the Dataset. Works on Datasets that have been deprecated or archived.

register(workspace, name, description=None, tags=None, visible=True, exist_ok=False, update_if_exist=False)

Register the Dataset in the workspace, making it available to other users of the workspace.

sample(sample_strategy, arguments)

Generate a new sample from the source Dataset, using the sampling strategy and parameters provided.

to_pandas_dataframe()

Create a Pandas dataframe by executing the transformation pipeline defined by this Dataset definition.

to_spark_dataframe()

Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataset definition.

update(name=None, description=None, tags=None, visible=None)

Update the Dataset mutable attributes in the workspace and return the updated Dataset from the workspace.

update_definition(definition, definition_update_message)

Update the Dataset definition.

archive()

Archive the Dataset.

archive()

Returns

None.

Return type

Remarks

After archival, any attempt to consume the Dataset will result in an error. If archived by accident, reactivate will activate it.

auto_read_files(path, include_path=False)

Analyzes the file(s) at the specified path and returns a new Dataset.

auto_read_files(path, include_path=False)

Parameters

path
DataReference or str

Data path in registered datastore or local path.

include_path
bool or optional

Whether to include a column containing the path of the file from which the data was read. Useful when reading multiple files, and want to know which file a particular record originated from. Also useful if there is information in file path or name that you want in a column.

default value: False

Returns

Dataset object.

Return type

Remarks

Use this method when you'd like to have file formats and delimiters detected automatically.

After creating a Dataset, you should use :func: ~azureml.core.Dataset.get_profile to list detected column types and summary statistics for each column.

The returned Dataset is not registered with the workspace.

compare_profiles(rhs_dataset, profile_arguments={}, include_columns=None, exclude_columns=None, histogram_compare_method=<HistogramCompareMethod.WASSERSTEIN: 0>)

Compare the current Dataset's profile with rhs_dataset profile.

compare_profiles(rhs_dataset, profile_arguments={}, include_columns=None, exclude_columns=None, histogram_compare_method=<HistogramCompareMethod.WASSERSTEIN: 0>)

Parameters

rhs_dataset
Dataset

Another Dataset also called right hand side Dataset for comparision.

profile_arguments
Dict or optional

Arguments to retrive specific profile.

default value: {}
include_columns
List[str] or optional

List of column names to be included in comparison.

default value: None
exclude_columns
List[str] or optional

List of column names to be excluded in comparison.

default value: None
histogram_compare_method
HistogramCompareMethod or optional

Enum describing the method, ex: Wasserstein or Energy

default value: HistogramCompareMethod.WASSERSTEIN

Returns

Difference of the profiles.

Return type

azureml.dataprep.api.typedefinitions.DataProfileDifference

Remarks

This is for registered Datasets only. Raises exception, if the current Dataset's profile does not exist. For unregistered Datasets use profile.compare method.

create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)

Create a snapshot of the registered Dataset.

create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)

Parameters

snapshot_name
str

The snapshot name. Snapshot names should be unique within a Dataset.

compute_target
ComputeTarget or str or optional

compute target to perform the snapshot profile creation, optional. If omitted, the local compute is used.

default value: None
create_data_snapshot
bool or optional

If True, a materialized copy of the data will be created, optional.

default value: False
target_datastore
AbstractAzureStorageDatastore or str or optional

Target datastore to save snapshot. If omitted, the snapshot will be created in the default storage of the workspace.

default value: None

Returns

Dataset snapshot object.

Return type

Remarks

Snapshots capture point in time summary statistics of the underlying data and an optional copy of the data itself. To learn more about creating snapshots, go to https://aka.ms/azureml/howto/createsnapshots.

delete_snapshot(snapshot_name)

Delete snapshot of the Dataset by name.

delete_snapshot(snapshot_name)

Parameters

snapshot_name
str

The snapshot name

Returns

None.

Return type

Remarks

Use this to free up storage consumed by data saved in snapshots that you no longer need.

deprecate(deprecate_by_dataset_id)

Deprecate the Dataset, with a pointer to the new Dataset.

deprecate(deprecate_by_dataset_id)

Parameters

deprecate_by_dataset_id
str

Dataset Id which is the intended replacement for this Dataset.

Returns

None.

Return type

Remarks

Deprecated Datasets will log warnings when they are consumed. Deprecating a dataset deprecates all its definitions.

Deprecated Datasets can still be consumed. To completely block a Dataset from being consumed, archive it.

If deprecated by accident, reactivate will activate it.

diff(rhs_dataset, compute_target=None, columns=None)

Diff the current Dataset with rhs_dataset.

diff(rhs_dataset, compute_target=None, columns=None)

Parameters

rhs_dataset
Dataset

Another Dataset also called right hand side Dataset for comparision

compute_target
ComputeTarget or str or optional

compute target to perform the diff, optional. If omitted, the local compute is used.

default value: None
columns
List[str] or optional

List of column names to be included in diff.

default value: None

Returns

Dataset action run object.

Return type

from_binary_files(path)

Create unregistered, in-memory Dataset from binary files.

from_binary_files(path)

Parameters

path
DataReference or str

Data path in registered datastore or local path.

Returns

Dataset object.

Return type

Remarks

Use this method to read files as streams of binary data. Returns one file stream object per file read. Use this method when you're reading images, videos, audio or other binary data.

func: ~azureml.core.dataset.Dataset.get_profile and :func: ~azureml.core.dataset.Dataset.create_snapshot will not work as expected for a Dataset created by this method.

The returned Dataset is not registered with the workspace.

from_delimited_files(path, separator=',', header=<PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS: 3>, encoding=<FileEncoding.UTF8: 0>, quoting=False, infer_column_types=True, skip_rows=0, skip_mode=<SkipLinesBehavior.NO_ROWS: 0>, comment=None, include_path=False, archive_options=None)

Create unregistered, in-memory Dataset from delimited files.

from_delimited_files(path, separator=',', header=<PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS: 3>, encoding=<FileEncoding.UTF8: 0>, quoting=False, infer_column_types=True, skip_rows=0, skip_mode=<SkipLinesBehavior.NO_ROWS: 0>, comment=None, include_path=False, archive_options=None)

Parameters

path
DataReference or str

Data path in registered datastore or local path.

separator
str

The separator used to split columns.

default value: ,
header

Controls how column headers are promoted when reading from files.

default value: PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS
encoding

The encoding of the files being read.

default value: FileEncoding.UTF8
quoting
bool or optional

Specify how to handle new line characters within quotes. The default (False) is to interpret new line characters as starting new rows, irrespective of whether the new line characters are within quotes or not. If set to True, new line characters inside quotes will not result in new rows, and file reading speed will slow down.

default value: False
infer_column_types
bool or optional

If true, column data types will be inferred.

default value: True
skip_rows
int or optional

How many rows to skip in the file(s) being read.

default value: 0
skip_mode

Controls how rows are skipped when reading from files.

default value: SkipLinesBehavior.NO_ROWS
comment
str or optional

Character used to indicate comment lines in the files being read. Lines beginning with this string will be skipped

default value: None
include_path
bool or optional

Whether to include a column containing the path of the file from which the data was read. This is useful when you are reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

default value: False
archive_options
azureml.dataprep.api._archiveoption.ArchiveOptions

Options for archive file, including archive type and entry glob pattern. We only support ZIP as archive type at the moment. For example, specifying


   archive_options = ArchiveOptions(archive_type = ArchiveType.ZIP, entry_glob = '*10-20.csv')

reads all files with name ending with "10-20.csv" in ZIP.

default value: None

Returns

Dataset object.

Return type

Remarks

Use this method to read delimited text files when you want to control the options used.

After creating a Dataset, you should use :func: ~azureml.core.dataset.Dataset.get_profile to list detected column types and summary statistics for each column.

The returned Dataset is not registered with the workspace.

from_excel_files(path, sheet_name=None, use_column_headers=False, skip_rows=0, include_path=False, infer_column_types=True)

Create unregistered, in-memory Dataset from Excel files.

from_excel_files(path, sheet_name=None, use_column_headers=False, skip_rows=0, include_path=False, infer_column_types=True)

Parameters

path
DataReference or str

Data path in registered datastore or local path.

sheet_name
str or optional

The name of the Excel sheet to load. By default we read the first sheet from each Excel file.

default value: None
use_column_headers
bool or optional

Controls whether to use the first row as column headers.

default value: False
skip_rows
int or optional

How many rows to skip in the file(s) being read.

default value: 0
include_path
bool or optional

Whether to include a column containing the path of the file from which the data was read. This is useful when you are reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

default value: False
infer_column_types
bool or optional

If true, column data types will be inferred.

default value: True

Returns

Dataset object.

Return type

Remarks

Use this method to read Excel files in .xlsx format. Data can be read from one sheet in each Excel file. After creating a Dataset, you should use :func: ~azureml.core.dataset.Dataset.get_profile to list detected column types and summary statistics for each column. The returned Dataset is not registered with the workspace.

from_json_files(path, encoding=<FileEncoding.UTF8: 0>, flatten_nested_arrays=False, include_path=False)

Create unregistered, in-memory Dataset from json files.

from_json_files(path, encoding=<FileEncoding.UTF8: 0>, flatten_nested_arrays=False, include_path=False)

Parameters

path
DataReference or str

The path to the file(s) or folder(s) that you want to load and parse. It can either be a local path or an Azure Blob url. Globbing is supported. For example, you can use path = "./data*" to read all files with name starting with "data".

encoding
azureml.dataprep.typedefinitions.FileEncoding

The encoding of the files being read.

default value: FileEncoding.UTF8
flatten_nested_arrays
bool or optional

Property controlling program's handling of nested arrays. If you choose to flatten nested JSON arrays, it could result in a much larger number of rows.

default value: False
include_path
bool or optional

Whether to include a column containing the path from which the data was read. This is useful when you are reading multiple files, and might want to know which file a particular record originated from, or to keep useful information in file path.

default value: False

Returns

Local Dataset object.

Return type

from_pandas_dataframe(dataframe, path=None, in_memory=False)

Create unregistered, in-memory Dataset from pandas dataframe.

from_pandas_dataframe(dataframe, path=None, in_memory=False)

Parameters

dataframe
DataFrame

Pandas DataFrame

path
DataReference or str

Data path in registered datastore or local folder path.

default value: None
in_memory
bool or optional

Whether to read the DataFrame from memory instead of persisting to disk

default value: False

Returns

Dataset object.

Return type

Remarks

Use this method to convert pandas dataframe to Dataset. Dataset created by this method can not be registered as the data is from memory.

If 'in_memory' is False, Pandas DataFrame is converted to csv file locally and, if path is DataRefernce, then pandas frame will be uploaded to data store, and create a Dataset based of the DataReference. If path is local folder, Dataset will be created off of local file. which cannot be deleted.

Raises exception, if the current Datareference is not a folder path.

from_parquet_files(path, include_path=False)

Create unregistered, in-memory Dataset from parquet files.

from_parquet_files(path, include_path=False)

Parameters

path
DataReference or str

Data path in registered datastore or local path.

include_path
bool or optional

Whether to include a column containing the path of the file from which the data was read. This is useful when you are reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

default value: False

Returns

Dataset object.

Return type

Remarks

Use this method to read Parquet files.

After creating a Dataset, you should use :func: ~azureml.core.dataset.Dataset.get_profile to list detected column types and summary statistics for each column.

The returned Dataset is not registered with the workspace.

from_sql_query(data_source, query)

Create unregistered, in-memory Dataset from sql query.

from_sql_query(data_source, query)

Parameters

data_source
AzureSqlDatabaseDatastore

The details of the Azure SQL datastore.

query
str

The query to execute to read data.

Returns

Local Dataset object.

Return type

generate_profile(compute_target=None, workspace=None, arguments=None)

Generate new profile for the Dataset.

generate_profile(compute_target=None, workspace=None, arguments=None)

Parameters

compute_target
ComputeTarget or str or optional

compute target to perform the snapshot profile creation, optional. If omitted, the local compute is used.

default value: None
workspace
Workspace or optional

Workspace, required for transient(unregistered) Datasets.

default value: None
arguments
Dict[str or object] or optional

Profile arguments. Valid arguments are

default value: None

Returns

Dataset action run object.

Return type

Remarks

Synchronous call, will block till it completes. Call :func: get_result to get the result of the action.

get(workspace, name=None, id=None)

Get a Dataset that already exists in the workspace by specifying either its name or id.

get(workspace, name=None, id=None)

Parameters

workspace
Workspace

The existing AzureML workspace in which the Dataset was created.

name
str or optional

The name of the Dataset to be retrieved.

default value: None
id
str or optional

Unique Identifier of the Dataset in the workspace.

default value: None

Returns

Dataset with the specified name or id.

Return type

Remarks

You can provide either name or id. If both are given, will throw an exception if name and id are not matching. Will throw an exception if the Dataset with the specified name or id cannot be found in the workspace.

get_all_snapshots()

Get all snapshots of the Dataset.

get_all_snapshots()

Returns

List of Dataset snapshots.

Return type

get_definition(version_id=None)

Get a specific definition of the Dataset.

get_definition(version_id=None)

Parameters

version_id
str or optional

The version_id of the Dataset definition

default value: None

Returns

Dataset definition

Return type

Remarks

If version_id is provided, then try to get the definition corresponding to that version. If that version does not exist, will throw an exception. If version_id is omitted, then retrieves the latest version.

get_definitions()

Get all the definitions of the Dataset.

get_definitions()

Returns

Dictionary of Dataset definitions

Return type

Remarks

A Dataset registered in an AzureML workspace can have multiple definitions, each created by calling :func: ~azureml.core.dataset.Dataset.update_definition. Each definition has an unique identifier. The current definition is the latest one created.

For unregistered Datasets, only one definition exists.

get_profile(arguments=None, generate_if_not_exist=True, workspace=None, compute_target=None)

Get summary statistics on the Dataset computed earlier.

get_profile(arguments=None, generate_if_not_exist=True, workspace=None, compute_target=None)

Parameters

arguments
Dict[str or object] or optional

Profile arguments.

default value: None
generate_if_not_exist
bool or optional

generate profile if it does not exist.

default value: True
workspace
Workspace or optional

Workspace, required for transient(unregistered) Datasets.

default value: None
compute_target
ComputeTarget or str or optional

compute target to execute the profile action, optional.

default value: None

Returns

DataProfile of the Dataset.

Return type

Remarks

For a Dataset registered with an AML workspace, this method retrieves an existing profile that was created earlier by calling :func: get_profile if it is still valid. Profiles are invalidated when changed data is detected in the Dataset or the arguments to get_profile are different from the ones used when the profile was generated. If the profile is not present or invalidated, generate_if_not_exist will determine if a new profile is generated.

For a Dataset that is not registered with an AML workspace, this method always runs generate_profile and returns the result.

get_snapshot(snapshot_name)

Get snapshot of the Dataset by name.

get_snapshot(snapshot_name)

Parameters

snapshot_name
str

The snapshot name

Returns

Dataset snapshot object.

Return type

head(count)

Pull the specified number of records specified from this Dataset and returns them as a DataFrame.

head(count)

Parameters

count
int

The number of records to pull.

Returns

A Pandas DataFrame.

Return type

pandas.core.frame.DataFrame

list(workspace)

List all of the Datasets in the workspace, including ones with is_visible=False.

list(workspace)

Parameters

workspace
Workspace

The :class: azureml.core.Workspace for which you want to retrieve the list of Datasets

Returns

List of Dataset objects

Return type

List[Dataset]

reactivate()

Reactivate the Dataset. Works on Datasets that have been deprecated or archived.

reactivate()

Returns

None.

Return type

register(workspace, name, description=None, tags=None, visible=True, exist_ok=False, update_if_exist=False)

Register the Dataset in the workspace, making it available to other users of the workspace.

register(workspace, name, description=None, tags=None, visible=True, exist_ok=False, update_if_exist=False)

Parameters

workspace
Workspace

The AzureML workspace in which the Dataset is to be registered

name
str

The name of the Dataset in the workspace

description
str or optional

Description of the Dataset.

default value: None
tags
dict[str or str] or optional

Tags to associate with the Dataset.

default value: None
visible
bool or optional

Controls visibility of the Dataset to the user in the UI. false=hidden in UI, available via SDK.

default value: True
exist_ok
bool or optional

If True the method returns the Dataset if it already exists in the given workspace, else error

default value: False
update_if_exist
bool or optional

If exist_ok is True and update_if_exist is True, this method will update the definition and return the updated Dataset.

default value: False

Returns

Registered Dataset object in the workspace.

Return type

sample(sample_strategy, arguments)

Generate a new sample from the source Dataset, using the sampling strategy and parameters provided.

sample(sample_strategy, arguments)

Parameters

sample_strategy
str

Sample strategy to use: top_n, simple_random or stratified.

arguments

Sample arguments.

Returns

Sample Dataset object.

Return type

Remarks

Samples are generated by executing the transformation pipeline defined by this Dataset, and then applying the sampling strategy and parameters to the output data.

to_pandas_dataframe()

Create a Pandas dataframe by executing the transformation pipeline defined by this Dataset definition.

to_pandas_dataframe()

Returns

A Pandas DataFrame.

Return type

pandas.core.frame.DataFrame

Remarks

Return a Pandas DataFrame fully materialized in memory.

to_spark_dataframe()

Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataset definition.

to_spark_dataframe()

Returns

A Spark DataFrame.

Return type

Remarks

The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated.

update(name=None, description=None, tags=None, visible=None)

Update the Dataset mutable attributes in the workspace and return the updated Dataset from the workspace.

update(name=None, description=None, tags=None, visible=None)

Parameters

name
str or optional

The name of the Dataset in the workspace.

default value: None
description
str or optional

Description of the data.

default value: None
tags
dict[str or str] or optional

Tags to associate the Dataset with.

default value: None
visible
bool or optional

Visibility of the Dataset to the user in the UI.

default value: None

Returns

Updated Dataset object from the workspace.

Return type

update_definition(definition, definition_update_message)

Update the Dataset definition.

update_definition(definition, definition_update_message)

Parameters

definition
azureml.data.DatasetDefinition

The new definition of this Dataset.

definition_update_message
str

Definition update message.

Returns

Updated Dataset object from the workspace.

Return type

Remarks

To consume the updated Dataset, use the object returned by this method.

Attributes

definition

Return the current Dataset definition.

Returns

Dataset definition.

Return type

Remarks

A Dataset definition is a series of steps that specify how to read and transform data.

A Dataset registered in an AzureML workspace can have multiple definitions, each created by calling :func: ~azureml.core.dataset.Dataset.update_definition. Each definition has an unique identifier. Having multiple definitions allows you to make changes to existing Datasets without breaking models and pipelines that depend on the older definition.

For unregistered Datasets, only one definition exists.

definition_version

Return the version of the current definition of the Dataset.

Returns

Dataset definition version.

Return type

str

Remarks

A Dataset definition is a series of steps that specify how to read and transform data.

A Dataset registered in an AzureML workspace can have multiple definitions, each created by calling :func: ~azureml.core.dataset.Dataset.update_definition. Each definition has an unique identifier. The current definition is the latest one created, whose id is returned by this.

For unregistered Datasets, only one definition exists.

description

Return the description of the Dataset.

Returns

Dataset description.

Return type

str

Remarks

Description of the data in the Dataset. Filling it in allows users of the workspace to understand what the data represents, and how they can use it.

id

If the Dataset was registered in an AzureML workspace, return the ID of the Dataset. Otherwise, return None.

Returns

Dataset id.

Return type

str

is_visible

Control the visibility of a registered Dataset in the Azure ML workspace UI.

Returns

Dataset visibility.

Return type

Remarks

Has no effect on unregistered Datasets.

name

Return the Dataset name.

Returns

Dataset name.

Return type

str

state

Return the state of the Dataset.

Returns

Dataset state.

Return type

str

tags

Return the tags associated with the Dataset.

Returns

Dataset tags.

Return type

workspace

If the Dataset was registered in an AzureML workspace, return that. Otherwise, returns None.

Returns

The workspace.

Return type