DatasetDefinition class

Definition

Defines a series of steps that specify how to read and transform data in a Dataset.

A Dataset registered in an Azure Machine Learning workspace can have multiple definitions, each created by calling update_definition(definition, definition_update_message). Each definition has an unique identifier. The current definition is the latest one created.

For unregistered Datasets, only one definition exists.

Dataset definitions support all the transformations listed for the Dataflow class: see http://aka.ms/azureml/howto/transformdata. To learn more about Dataset Definitions, go to https://aka.ms/azureml/howto/versiondata.

DatasetDefinition(workspace=None, dataset_id=None, version_id=None, dataflow=None, dataflow_json=None, notes=None, etag=None, created_time=None, modified_time=None, state=None, deprecated_by_dataset_id=None, deprecated_by_definition_version=None, data_path=None, dataset=None, file_type='Unknown')
Inheritance
builtins.object
azureml.dataprep.api.dataflow.Dataflow
DatasetDefinition

Methods

archive()

Archive the dataset definition.

create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)

Create a snapshot of the registered Dataset.

deprecate(deprecate_by_dataset_id, deprecated_by_definition_version=None)

Deprecate the Dataset, with a pointer to the new Dataset.

reactivate()

Reactivate the dataset definition.

Works on dataset definitions that have been deprecated or archived.

to_pandas_dataframe()

Create a Pandas dataframe by executing the transformation pipeline defined by this dataset definition.

to_spark_dataframe()

Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataflow.

archive()

Archive the dataset definition.

archive()

Returns

None.

Return type

Remarks

After archival, any attempt to retrieve the dataset will result in an error. If archived by accident, use reactivate() to activate it.

create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)

Create a snapshot of the registered Dataset.

create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)

Parameters

snapshot_name
str

The snapshot name. Snapshot names should be unique within a Dataset.

compute_target
ComputeTarget or str, optional

The compute target to perform the snapshot profile creation. If omitted, the local compute is used.

default value: None
create_data_snapshot
bool, optional

If True, a materialized copy of the data will be created.

default value: False
target_datastore
AbstractAzureStorageDatastore or str, optional

The target datastore where to save snapshot. If omitted, the snapshot will be created in the default storage of the workspace.

default value: None

Returns

A DatasetSnapshot object.

Return type

Remarks

Snapshots capture point in time summary statistics of the underlying data and an optional copy of the data itself. To learn more about creating snapshots, go to https://aka.ms/azureml/howto/createsnapshots.

deprecate(deprecate_by_dataset_id, deprecated_by_definition_version=None)

Deprecate the Dataset, with a pointer to the new Dataset.

deprecate(deprecate_by_dataset_id, deprecated_by_definition_version=None)

Parameters

deprecate_by_dataset_id
uuid

The dataset ID which is responsible for the deprecation of current dataset.

deprecated_by_definition_version
str

The dataset definition version which is responsible for the deprecation of current dataset definition.

default value: None

Returns

None.

Return type

Remarks

Deprecated dataset definitions will log warnings when they are consumed. To completely block a dataset definition from being consumed, archive it.

If a dataset definition is deprecated by accident, use reactivate() to activate it.

reactivate()

Reactivate the dataset definition.

Works on dataset definitions that have been deprecated or archived.

reactivate()

Returns

None.

Return type

to_pandas_dataframe()

Create a Pandas dataframe by executing the transformation pipeline defined by this dataset definition.

to_pandas_dataframe()

Returns

A Pandas DataFrame.

Return type

Remarks

Return a Pandas DataFrame fully materialized in memory.

to_spark_dataframe()

Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataflow.

to_spark_dataframe()

Returns

A Spark DataFrame.

Return type

Remarks

The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated.