DatasetSnapshot Class

Manages Dataset snapshots with operations to get a snapsot, return its status, and convert it to a dataframe.

Note

This class is deprecated. For more information, see https://aka.ms/dataset-deprecation.

A DataSnapshot object is returned from the create_snapshot method of the Dataset class.

Dataset snapshot is a combination of Profile and an optional materialized copy of the data.

To learn more about Dataset Snapshots, go to https://aka.ms/azureml/howto/createsnapshots

Inheritance
builtins.object
DatasetSnapshot

Constructor

DatasetSnapshot(workspace, snapshot_name, dataset_id, definition_version=None, time_stamp=None, profile_action_id=None, datastore_name=None, relative_path=None, dataset_name=None)

Parameters

workspace
<xref:azureml.core.Workspace.>
Required

The workspace the Dataset is registered in.

snapshot_name
str
Required

The name of the Dataset snapshot.

dataset_id
str
Required

The identifier of the Dataset.

definition_version
str
Required

The definition version of the Dataset.

time_stamp
datetime
Required

The snapshot creation time.

profile_action_id
str
Required

The snapshot profile action ID.

datastore_name
str
Required

The snapshot data store name.

relative_path
str
Required

The relative path to the snapshot data.

dataset_name
str
Required

The name of the Dataset.

Methods

compare_profiles

Compare the current dataset profile with rhs_dataset profile.

If profiles do not exist, this method will raise an exception.

get

Get the snapshot of Dataset by snapshot name.

get_all

Get all the snapshots of the given Dataset.

get_profile

Get the profile of the Dataset snapshot.

get_status

Get the Dataset snapshot creation status.

is_data_snapshot_available

Check if the materialized copy of the snapshot is available.

to_pandas_dataframe

Create a Pandas DataFrame by loading the data saved with the snapshot.

to_spark_dataframe

Create a Spark DataFrame by loading the data saved with the snapshot.

wait_for_completion

Wait for the completion of DatasetSnapshot generaton.

compare_profiles

Compare the current dataset profile with rhs_dataset profile.

If profiles do not exist, this method will raise an exception.

compare_profiles(rhs_dataset_snapshot, include_columns=None, exclude_columns=None, histogram_compare_method=HistogramCompareMethod.WASSERSTEIN)

Parameters

rhs_dataset_snapshot
DatasetSnapshot
Required

The Dataset snapshot to compare with.

include_columns
list[str]
default value: None

A list of column names to be included in the comparison.

exclude_columns
list[str]
default value: None

A list of column names to be excluded in the comparison.

histogram_compare_method
HistogramCompareMethod
default value: HistogramCompareMethod.WASSERSTEIN

An enum describing the comparison method, for example: WASSERSTEIN or ENERGY.

Returns

The difference between the profiles.

Return type

<xref:azureml.dataprep.api.engineapi.typedefinitions.DataProfileDifference>

get

Get the snapshot of Dataset by snapshot name.

static get(workspace, snapshot_name, dataset_name=None, dataset_id=None)

Parameters

workspace
Workspace
Required

The workspace the Dataset is registered in.

snapshot_name
str
Required

The name of the Dataset snapshot.

dataset_name
Required

The name of the Dataset.

dataset_id
uuid
Required

The identifier of the Dataset.

Returns

A DatasetSnapshot object.

Return type

get_all

Get all the snapshots of the given Dataset.

static get_all(workspace, dataset_name)

Parameters

workspace
Workspace
Required

The workspace the Dataset is registered in.

dataset_name
Required

The name of the Dataset.

Returns

A list of Dataset snapshots

Return type

get_profile

Get the profile of the Dataset snapshot.

get_profile()

Returns

The DataProfile of the Dataset snapshot

Return type

<xref:azureml.dataprep.DataProfile>

get_status

Get the Dataset snapshot creation status.

get_status()

Returns

The status of Dataset snapshot.

Return type

str

is_data_snapshot_available

Check if the materialized copy of the snapshot is available.

is_data_snapshot_available()

Returns

True if the data snapshot is available.

Return type

to_pandas_dataframe

Create a Pandas DataFrame by loading the data saved with the snapshot.

to_pandas_dataframe()

Returns

A Pandas DataFrame.

Return type

Remarks

The Pandas DataFrame is fully materialized in memory. If the snapshot was created with create_data_snapshot=False, then an exception is thrown. To check if the snapshot contains data, use the function is_data_snapshot_available.

to_spark_dataframe

Create a Spark DataFrame by loading the data saved with the snapshot.

to_spark_dataframe()

Returns

A Spark DataFrame.

Return type

Remarks

The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated. If the snapshot was created with create_data_snapshot=False, an exception is thrown when you try to access the data. To check if the snapshot contains data, use is_data_snapshot_available.

wait_for_completion

Wait for the completion of DatasetSnapshot generaton.

wait_for_completion(show_output=True, status_update_frequency=10)

Parameters

show_output
bool
default value: True

Indicates if the method will print the output.

status_update_frequency
int
default value: 10

The Action run status update frequency in seconds.

Attributes

dataset_id

Get the Dataset identifier.

Returns

The Dataset ID.

Return type

str

name

Get the Dataset snapshot name.

Returns

The Dataset snapshot name.

Return type

str

workspace

Get the Azure Machine Learning workspace where the Dataset is registered.

Returns

The workspace where the Dataset is registered.

Return type