DatasetSnapshot Class

Reference

Manages Dataset snapshots with operations to get a snapsot, return its status, and convert it to a dataframe.

Note

This class is deprecated. For more information, see https://aka.ms/dataset-deprecation.

A DataSnapshot object is returned from the create_snapshot method of the Dataset class.

Dataset snapshot is a combination of Profile and an optional materialized copy of the data.

To learn more about Dataset Snapshots, go to https://aka.ms/azureml/howto/createsnapshots

Inheritance: builtins.object

DatasetSnapshot

Constructor

DatasetSnapshot(workspace, snapshot_name, dataset_id, definition_version=None, time_stamp=None, profile_action_id=None, datastore_name=None, relative_path=None, dataset_name=None)

Parameters

workspace: <xref:azureml.core.Workspace.>

Required

The workspace the Dataset is registered in.

snapshot_name: str

Required

The name of the Dataset snapshot.

dataset_id: str

Required

The identifier of the Dataset.

definition_version: str

Required

The definition version of the Dataset.

time_stamp: datetime

Required

The snapshot creation time.

profile_action_id: str

Required

The snapshot profile action ID.

datastore_name: str

Required

The snapshot data store name.

relative_path: str

Required

The relative path to the snapshot data.

dataset_name: str

Required

The name of the Dataset.

Methods

compare_profiles	Compare the current dataset profile with rhs_dataset profile. If profiles do not exist, this method will raise an exception.
get	Get the snapshot of Dataset by snapshot name.
get_all	Get all the snapshots of the given Dataset.
get_profile	Get the profile of the Dataset snapshot.
get_status	Get the Dataset snapshot creation status.
is_data_snapshot_available	Check if the materialized copy of the snapshot is available.
to_pandas_dataframe	Create a Pandas DataFrame by loading the data saved with the snapshot.
to_spark_dataframe	Create a Spark DataFrame by loading the data saved with the snapshot.
wait_for_completion	Wait for the completion of DatasetSnapshot generaton.

compare_profiles

Compare the current dataset profile with rhs_dataset profile.

If profiles do not exist, this method will raise an exception.

compare_profiles(rhs_dataset_snapshot, include_columns=None, exclude_columns=None, histogram_compare_method=HistogramCompareMethod.WASSERSTEIN)

Parameters

rhs_dataset_snapshot: DatasetSnapshot

Required

The Dataset snapshot to compare with.

include_columns: list[str]

default value: None

A list of column names to be included in the comparison.

exclude_columns: list[str]

default value: None

A list of column names to be excluded in the comparison.

histogram_compare_method: HistogramCompareMethod

default value: HistogramCompareMethod.WASSERSTEIN

An enum describing the comparison method, for example: WASSERSTEIN or ENERGY.

Returns

The difference between the profiles.

Return type

<xref:azureml.dataprep.api.engineapi.typedefinitions.DataProfileDifference>

get

Get the snapshot of Dataset by snapshot name.

static get(workspace, snapshot_name, dataset_name=None, dataset_id=None)

Parameters

workspace: Workspace

Required

The workspace the Dataset is registered in.

snapshot_name: str

Required

The name of the Dataset snapshot.

dataset_name

Required

The name of the Dataset.

dataset_id: uuid

Required

The identifier of the Dataset.

Returns

A DatasetSnapshot object.

Return type

DatasetSnapshot

get_all

Get all the snapshots of the given Dataset.

static get_all(workspace, dataset_name)

Parameters

workspace: Workspace

Required

The workspace the Dataset is registered in.

dataset_name

The Pandas DataFrame is fully materialized in memory. If the snapshot was created with create_data_snapshot=False, then an exception is thrown. To check if the snapshot contains data, use the function is_data_snapshot_available.

to_spark_dataframe

Create a Spark DataFrame by loading the data saved with the snapshot.

to_spark_dataframe()

Returns

A Spark DataFrame.

Return type

DataFrame

Remarks

The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated. If the snapshot was created with create_data_snapshot=False, an exception is thrown when you try to access the data. To check if the snapshot contains data, use is_data_snapshot_available.

wait_for_completion

Wait for the completion of DatasetSnapshot generaton.

wait_for_completion(show_output=True, status_update_frequency=10)

Parameters

show_output: bool

default value: True

Indicates if the method will print the output.

status_update_frequency: int

default value: 10

The Action run status update frequency in seconds.

Workspace

DatasetSnapshot Class

Constructor

Parameters

Methods

compare_profiles

Parameters

Returns

Return type

get

Parameters

Returns

Return type

get_all

Parameters

Returns

Return type

get_profile

Returns

Return type

get_status

Returns

Return type

is_data_snapshot_available

Returns

Return type

to_pandas_dataframe

Returns

Return type

Remarks

to_spark_dataframe

Returns

Return type

Remarks

wait_for_completion

Parameters

Attributes

dataset_id

Returns

Return type

name

Returns

Return type

workspace

Returns

Return type

Feedback

Feedback

Additional resources