DatasetDefinition Class

Reference

Defines a series of steps that specify how to read and transform data in a Dataset.

Note

This class is deprecated. For more information, see https://aka.ms/dataset-deprecation.

A Dataset registered in an Azure Machine Learning workspace can have multiple definitions, each created by calling update_definition. Each definition has an unique identifier. The current definition is the latest one created.

For unregistered Datasets, only one definition exists.

Dataset definitions support all the transformations listed for the <xref:azureml.dataprep.Dataflow> class: see http://aka.ms/azureml/howto/transformdata. To learn more about Dataset Definitions, go to https://aka.ms/azureml/howto/versiondata.

Initialize the Dataset definition object.

Inheritance: azureml.dataprep.api.engineless_dataflow.EnginelessDataflow

DatasetDefinition

Constructor

DatasetDefinition(workspace=None, dataset_id=None, version_id=None, dataflow=None, dataflow_json=None, notes=None, etag=None, created_time=None, modified_time=None, state=None, deprecated_by_dataset_id=None, deprecated_by_definition_version=None, data_path=None, dataset=None, file_type='Unknown')

Parameters

workspace: str

Required

The workspace the Dataset is registered in.

dataset_id: str

Required

The Dataset identifier.

version_id: str

Required

The definition version.

dataflow: str

Required

The Dataflow object.

dataflow_json

Required

The Dataflow json.

notes: str

Required

Optional information about the definition.

etag: str

Required

Etag.

created_time: datetime

Required

The creation time of the definition.

modified_time: datetime

Required

The last modified time of the definition.

deprecated_by_dataset_id: str

Required

The ID of the Dataset that deprecates this definition.

deprecated_by_definition_version: str

Required

The version of the definition that deprecates this definition.

data_path: DataPath

Required

The data path.

dataset: Dataset

Required

The parent Dataset object.

Methods

archive	Archive the dataset definition.
create_snapshot	Create a snapshot of the registered Dataset.
deprecate	Deprecate the Dataset, with a pointer to the new Dataset.
reactivate	Reactivate the dataset definition. Works on dataset definitions that have been deprecated or archived.
to_pandas_dataframe	Create a Pandas dataframe by executing the transformation pipeline defined by this dataset definition.
to_spark_dataframe	Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataflow.

create_snapshot

Create a snapshot of the registered Dataset.

create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)

Parameters

snapshot_name: str

Required

The snapshot name. Snapshot names should be unique within a Dataset.

compute_target: ComputeTarget or str

default value: None

The compute target to perform the snapshot profile creation. If omitted, the local compute is used.

create_data_snapshot: bool

default value: False

If True, a materialized copy of the data will be created.

target_datastore: Union[AbstractAzureStorageDatastore, str]

default value: None

The target datastore where to save snapshot. If omitted, the snapshot will be created in the default storage of the workspace.

Returns

A DatasetSnapshot object.

Return type

DatasetSnapshot

Remarks

Snapshots capture point in time summary statistics of the underlying data and an optional copy of the data itself. To learn more about creating snapshots, go to https://aka.ms/azureml/howto/createsnapshots.

deprecate

Deprecate the Dataset, with a pointer to the new Dataset.

deprecate(deprecate_by_dataset_id, deprecated_by_definition_version=None)

Parameters

deprecate_by_dataset_id: uuid

Required

The dataset ID which is responsible for the deprecation of current dataset.

deprecated_by_definition_version: str

default value: None

The dataset definition version which is responsible for the deprecation of current dataset definition.

Returns

None.

Return type

None

Remarks

Deprecated dataset definitions will log warnings when they are consumed. To completely block a dataset definition from being consumed, archive it.

If a dataset definition is deprecated by accident, use reactivate to activate it.

reactivate

Reactivate the dataset definition.

Works on dataset definitions that have been deprecated or archived.

reactivate()

Returns

None.

Return type

None

to_pandas_dataframe

Create a Pandas dataframe by executing the transformation pipeline defined by this dataset definition.

to_pandas_dataframe()

Returns

A Pandas DataFrame.

Return type

DataFrame

Remarks

Return a Pandas DataFrame fully materialized in memory.

to_spark_dataframe

Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataflow.

to_spark_dataframe()

Returns

A Spark DataFrame.

Return type

DataFrame

Remarks

The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated.

DatasetDefinition Class

Constructor

Parameters

Methods

archive

Returns

Return type

Remarks

create_snapshot

Parameters

Returns

Return type

Remarks

deprecate

Parameters

Returns

Return type

Remarks

reactivate

Returns

Return type

to_pandas_dataframe

Returns

Return type

Remarks

to_spark_dataframe

Returns

Return type

Remarks

Feedback

Feedback

Additional resources