DatasetSnapshot class

Definition

A class for managing DatasetSnapshot operations.

DatasetSnapshot(workspace, snapshot_name, dataset_id, definition_version=None, time_stamp=None, profile_action_id=None, datastore_name=None, relative_path=None, dataset_name=None)
Inheritance
builtins.object
DatasetSnapshot

Methods

compare_profiles(rhs_dataset_snapshot, include_columns=None, exclude_columns=None, histogram_compare_method=<HistogramCompareMethod.WASSERSTEIN: 0>)

Compare the current dataset profile with rhs_dataset profile.

If profiles do not exists this method will raise an exception.

get(workspace, snapshot_name, dataset_name=None, dataset_id=None)

Get the snapshot of Dataset by snapshot name.

get_all(workspace, dataset_name)

Get all the snapshots of the given Dataset.

get_profile()

Get the profile of the Dataset snapshot.

get_status()

Get the Dataset snapshot creation status.

is_data_snapshot_available()

Check if the materialized copy of the snapshot is available.

to_pandas_dataframe()

Create a Pandas dataframe by loading the data saved with the snapshot.

to_spark_dataframe()

Create a Spark DataFrame by loading the data saved with the snapshot.

wait_for_completion(show_output=True, status_update_frequency=10)

Wait for the completion of DatasetSnapshot generaton.

compare_profiles(rhs_dataset_snapshot, include_columns=None, exclude_columns=None, histogram_compare_method=<HistogramCompareMethod.WASSERSTEIN: 0>)

Compare the current dataset profile with rhs_dataset profile.

If profiles do not exists this method will raise an exception.

compare_profiles(rhs_dataset_snapshot, include_columns=None, exclude_columns=None, histogram_compare_method=<HistogramCompareMethod.WASSERSTEIN: 0>)

Parameters

rhs_dataset_snapshot
DatasetSnapshot

Dataset snapshot to compare with.

include_columns
List[str]

List of column names to be included in comparison.

default value: None
exclude_columns
List[str]

List of column names to be excluded in comparison.

default value: None
histogram_compare_method
HistogramCompareMethod

Enum describing the method, ex: WASSERSTEIN or ENERGY

default value: HistogramCompareMethod.WASSERSTEIN

Returns

Difference of the profiles.

Return type

azureml.dataprep.api.typedefinitions.DataProfileDifference

get(workspace, snapshot_name, dataset_name=None, dataset_id=None)

Get the snapshot of Dataset by snapshot name.

get(workspace, snapshot_name, dataset_name=None, dataset_id=None)

Parameters

workspace
Workspace

The workspace.

snapshot_name
str

The name of the Dataset snapshot.

dataset_name

The name of the Dataset.

default value: None
dataset_id
uuid

Identifier of the Dataset.

default value: None

Returns

Dataset snapshot

Return type

azureml.data.DatasetSnapshot

get_all(workspace, dataset_name)

Get all the snapshots of the given Dataset.

get_all(workspace, dataset_name)

Parameters

workspace
Workspace

The workspace.

dataset_name

The name of the Dataset.

Returns

List of Dataset snapshots

Return type

List[azureml.data.DatasetSnapshot]

get_profile()

Get the profile of the Dataset snapshot.

get_profile()

Returns

DataProfile of the Dataset snapshot

Return type

get_status()

Get the Dataset snapshot creation status.

get_status()

Returns

Status of Dataset snapshot

Return type

str

is_data_snapshot_available()

Check if the materialized copy of the snapshot is available.

is_data_snapshot_available()

Returns

True if data snapshot is available.

Return type

to_pandas_dataframe()

Create a Pandas dataframe by loading the data saved with the snapshot.

to_pandas_dataframe()

Returns

A Pandas DataFrame.

Return type

pandas.core.frame.DataFrame

Remarks

The Pandas DataFrame is fully materialized in memory. If the snapshot was created with create_data_snapshot=False, will throw an exception. To check if the snapshot contains data, use :func: is_data_snapshot_available.

to_spark_dataframe()

Create a Spark DataFrame by loading the data saved with the snapshot.

to_spark_dataframe()

Returns

A Spark DataFrame.

Return type

Remarks

The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated. If the snapshot was created with create_data_snapshot=False, will throw an exception when you try to access the data.To check if the snapshot contains data, use :func: is_data_snapshot_available.

wait_for_completion(show_output=True, status_update_frequency=10)

Wait for the completion of DatasetSnapshot generaton.

wait_for_completion(show_output=True, status_update_frequency=10)

Parameters

show_output
bool

If True the method will print the output, optional

default value: True
status_update_frequency
int

Action run status update frequency in seconds, optional

default value: 10

Attributes

dataset_id

Dataset identifier.

Returns

Dataset id.

Return type

str

name

Return the Dataset snapshot name.

Returns

Dataset snapshot name.

Return type

str

workspace

AML workspace where the Dataset is registered.

Returns

The workspace.

Return type