Dataset class

Definition

Represents a resource for exploring, transforming, and managing data in Azure Machine Learning.

A Dataset is a reference to data in a Datastore. The following Datasets types are supported:

  • TabularDataset represents data in a tabular format created by parsing the provided file or list of files.

  • FileDataset references single or multiple files in datastores or from public URLs.

You can explore data in a Dataset with summary statistics and transform it using intelligent transforms. When you are ready to use the data for training, you can save the Dataset to your Azure Machine Learning workspace as a versioned Dataset.

To get started with datasets, see the article Add & register datasets, or see the notebooks https://aka.ms/tabulardataset-samplenotebook and https://aka.ms/filedataset-samplenotebook.

Dataset(definition, workspace=None, name=None, id=None)
Inheritance
builtins.object
Dataset

Remarks

The Dataset class exposes two convenience class attributes (File and Tabular) you can use for creating a Dataset without working with the corresponding factory methods. For example, to create a dataset using these attributes:

  • Dataset.Tabular.from_delimited_files()

  • Dataset.File.from_files()

You can also create a new TabularDataset or FileDataset working directly with their corresponding factory methods TabularDatasetFactory and FileDatasetFactory.

The following example shows how to create a TabularDataset pointing to a single path in a datastore.


   from azureml.core import Dataset
   dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-dataset/tabular/iris.csv')])

   # preview the first 3 rows of the dataset
   dataset.take(3).to_pandas_dataframe()

Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datasets-tutorial/train-with-datasets.ipynb

Variables

azureml.core.Dataset.File
class
A class attribute that provides access to the FileDatasetFactory methods for creating new FileDataset objects. Usage: Dataset.File.from_files().
azureml.core.Dataset.Tabular
class
A class attribute that provides access to the TabularDatasetFactory methods for creating new TabularDataset objects. Usage: Dataset.Tabular.from_delimited_files().

Methods

archive()

Archive an active or deprecated dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

auto_read_files(path, include_path=False, partition_format=None)

Analyzes the file(s) at the specified path and returns a new Dataset.

Note

This method is deprecated. Use the Dataset.Tabular.from_* methods to read files.

For more information, see https://aka.ms/dataset-deprecation.

compare_profiles(rhs_dataset, profile_arguments={}, include_columns=None, exclude_columns=None, histogram_compare_method=<HistogramCompareMethod.WASSERSTEIN: 0>)

Compare the current Dataset's profile with another dataset profile.

This shows the differences in summary statistics between two datasets. The parameter 'rhs_dataset' stands for "right-hand side", and is simply the second dataset. The first dataset (the current dataset object) is considered the "left-hand side".

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)

Create a snapshot of the registered Dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

delete_snapshot(snapshot_name)

Delete snapshot of the Dataset by name.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

deprecate(deprecate_by_dataset_id)

Deprecate an active dataset in a workspace by another dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

diff(rhs_dataset, compute_target=None, columns=None)

Diff the current Dataset with rhs_dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

from_binary_files(path)

Create an unregistered, in-memory Dataset from binary files.

Note

This method is deprecated. Use Dataset.File.from_files instead.

For more information, see https://aka.ms/dataset-deprecation.

from_delimited_files(path, separator=',', header=<PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS: 3>, encoding=<FileEncoding.UTF8: 0>, quoting=False, infer_column_types=True, skip_rows=0, skip_mode=<SkipLinesBehavior.NO_ROWS: 0>, comment=None, include_path=False, archive_options=None, partition_format=None)

Create an unregistered, in-memory Dataset from delimited files.

Note

This method is deprecated. Use Dataset.Tabular.from_delimited_files instead.

For more information, see https://aka.ms/dataset-deprecation.

from_excel_files(path, sheet_name=None, use_column_headers=False, skip_rows=0, include_path=False, infer_column_types=True, partition_format=None)

Create an unregistered, in-memory Dataset from Excel files.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

from_json_files(path, encoding=<FileEncoding.UTF8: 0>, flatten_nested_arrays=False, include_path=False, partition_format=None)

Create an unregistered, in-memory Dataset from JSON files.

Note

This method is deprecated. Use Dataset.Tabular.from_json_lines_files instead to read

from JSON lines file. For more information, see https://aka.ms/dataset-deprecation.

from_pandas_dataframe(dataframe, path=None, in_memory=False)

Create an unregistered, in-memory Dataset from a pandas dataframe.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

from_parquet_files(path, include_path=False, partition_format=None)

Create an unregistered, in-memory Dataset from parquet files.

Note

This method is deprecated. Use Dataset.Tabular.from_parquet_files instead.

For more information, see https://aka.ms/dataset-deprecation.

from_sql_query(data_source, query)

Create an unregistered, in-memory Dataset from a SQL query.

Note

This method is deprecated. Use Dataset.Tabular.from_sql_query instead.

For more information, see https://aka.ms/dataset-deprecation.

generate_profile(compute_target=None, workspace=None, arguments=None)

Generate new profile for the Dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

get(workspace, name=None, id=None)

Get a Dataset that already exists in the workspace by specifying either its name or ID.

Note

This method is deprecated. Use

get_by_name(workspace, name, version='latest') and get_by_id(workspace, id) instead.

For more information, see https://aka.ms/dataset-deprecation.

get_all(workspace)

Get all the registered datasets in the workspace.

get_all_snapshots()

Get all snapshots of the Dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

get_by_id(workspace, id)

Get a Dataset which is saved to the workspace.

get_by_name(workspace, name, version='latest')

Get a registered Dataset from workspace by its registration name.

get_definition(version_id=None)

Get a specific definition of the Dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

get_definitions()

Get all the definitions of the Dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

get_profile(arguments=None, generate_if_not_exist=True, workspace=None, compute_target=None)

Get summary statistics on the Dataset computed earlier.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

get_snapshot(snapshot_name)

Get snapshot of the Dataset by name.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

head(count)

Pull the specified number of records specified from this Dataset and returns them as a DataFrame.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

list(workspace)

List all the Datasets in the workspace, including ones with is_visible property equal to False.

Note

This property is deprecated. Use get_all(workspace) instead.

For more information, see https://aka.ms/dataset-deprecation.

reactivate()

Reactivate an archived or deprecated dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

register(workspace, name, description=None, tags=None, visible=True, exist_ok=False, update_if_exist=False)

Register the Dataset in the workspace, making it available to other users of the workspace.

sample(sample_strategy, arguments)

Generate a new sample from the source Dataset, using the sampling strategy and parameters provided.

Note

This method is deprecated. Create a

TabularDataset by calling the static methods on

Dataset.Tabular and use the take_sample(probability, seed=None) method there.

For more information, see https://aka.ms/dataset-deprecation.

to_pandas_dataframe()

Create a Pandas dataframe by executing the transformation pipeline defined by this Dataset definition.

Note

This method is deprecated. Create a

TabularDataset by calling the static methods on

Dataset.Tabular and use the to_pandas_dataframe()

method there. For more information, see https://aka.ms/dataset-deprecation.

to_spark_dataframe()

Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataset definition.

Note

This method is deprecated. Create a

TabularDataset by calling the static methods on Dataset.Tabular

and use the to_spark_dataframe() method there.

For more information, see https://aka.ms/dataset-deprecation.

update(name=None, description=None, tags=None, visible=None)

Update the Dataset mutable attributes in the workspace and return the updated Dataset from the workspace.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

update_definition(definition, definition_update_message)

Update the Dataset definition.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

archive()

Archive an active or deprecated dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

archive()

Returns

None.

Return type

Remarks

After archival, any attempt to consume the Dataset will result in an error. If archived by accident, reactivate will activate it.

auto_read_files(path, include_path=False, partition_format=None)

Analyzes the file(s) at the specified path and returns a new Dataset.

Note

This method is deprecated. Use the Dataset.Tabular.from_* methods to read files.

For more information, see https://aka.ms/dataset-deprecation.

auto_read_files(path, include_path=False, partition_format=None)

Parameters

path
DataReference or str

A data path in a registered datastore, a local path, or an HTTP URL(CSV/TSV).

include_path
bool or optional

Whether to include a column containing the path of the file from which the data was read. Useful when reading multiple files, and want to know which file a particular record originated from. Also useful if there is information in file path or name that you want in a column.

partition_format
str or optional

Specify the partition format in path and create string columns from format '{x}' and datetime column from format '{x:yyyy/MM/dd/HH/mm/ss}', where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extrat year, month, day, hour, minute and second for the datetime type. The format should start from the postition of first partition key until the end of file path. For example, given a file path '../USA/2019/01/01/data.csv' and data is partitioned by country and time, we can define '/{Country}/{PartitionDate:yyyy/MM/dd}/data.csv' to create columns 'Country' of string type and 'PartitionDate' of datetime type.

Returns

Dataset object.

Return type

Remarks

Use this method when to have file formats and delimiters detected automatically.

After creating a Dataset, you should use get_profile(arguments=None, generate_if_not_exist=True, workspace=None, compute_target=None) to list detected column types and summary statistics for each column.

The returned Dataset is not registered with the workspace.

compare_profiles(rhs_dataset, profile_arguments={}, include_columns=None, exclude_columns=None, histogram_compare_method=<HistogramCompareMethod.WASSERSTEIN: 0>)

Compare the current Dataset's profile with another dataset profile.

This shows the differences in summary statistics between two datasets. The parameter 'rhs_dataset' stands for "right-hand side", and is simply the second dataset. The first dataset (the current dataset object) is considered the "left-hand side".

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

compare_profiles(rhs_dataset, profile_arguments={}, include_columns=None, exclude_columns=None, histogram_compare_method=<HistogramCompareMethod.WASSERSTEIN: 0>)

Parameters

rhs_dataset
Dataset

A second Dataset, also called a "right-hand side" Dataset for comparision.

profile_arguments
dict or optional

Arguments to retrive specific profile.

include_columns
list[str] or optional

List of column names to be included in comparison.

exclude_columns
list[str] or optional

List of column names to be excluded in comparison.

histogram_compare_method
HistogramCompareMethod or optional

Enum describing the comparison method, ex: Wasserstein or Energy

Returns

Difference between the two dataset profiles.

Return type

azureml.dataprep.api.typedefinitions.DataProfileDifference

Remarks

This is for registered Datasets only. Raises an exception if the current Dataset's profile does not exist. For unregistered Datasets use profile.compare method.

create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)

Create a snapshot of the registered Dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)

Parameters

snapshot_name
str

The snapshot name. Snapshot names should be unique within a Dataset.

compute_target
ComputeTarget or str or optional

Optional compute target to perform the snapshot profile creation. If omitted, the local compute is used.

create_data_snapshot
bool or optional

If True, a materialized copy of the data will be created, optional.

target_datastore
AbstractAzureStorageDatastore or str or optional

Target datastore to save snapshot. If omitted, the snapshot will be created in the default storage of the workspace.

Returns

Dataset snapshot object.

Return type

Remarks

Snapshots capture point in time summary statistics of the underlying data and an optional copy of the data itself. To learn more about creating snapshots, go to https://aka.ms/azureml/howto/createsnapshots.

delete_snapshot(snapshot_name)

Delete snapshot of the Dataset by name.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

delete_snapshot(snapshot_name)

Parameters

snapshot_name
str

The snapshot name.

Returns

None.

Return type

Remarks

Use this to free up storage consumed by data saved in snapshots that you no longer need.

deprecate(deprecate_by_dataset_id)

Deprecate an active dataset in a workspace by another dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

deprecate(deprecate_by_dataset_id)

Parameters

deprecate_by_dataset_id
str

The Dataset ID which is the intended replacement for this Dataset.

Returns

None.

Return type

Remarks

Deprecated Datasets will log warnings when they are consumed. Deprecating a dataset deprecates all its definitions.

Deprecated Datasets can still be consumed. To completely block a Dataset from being consumed, archive it.

If deprecated by accident, reactivate will activate it.

diff(rhs_dataset, compute_target=None, columns=None)

Diff the current Dataset with rhs_dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

diff(rhs_dataset, compute_target=None, columns=None)

Parameters

rhs_dataset
Dataset

Another Dataset also called right hand side Dataset for comparision

compute_target
ComputeTarget or str or optional

compute target to perform the diff, optional. If omitted, the local compute is used.

columns
list[str] or optional

List of column names to be included in diff.

Returns

Dataset action run object.

Return type

from_binary_files(path)

Create an unregistered, in-memory Dataset from binary files.

Note

This method is deprecated. Use Dataset.File.from_files instead.

For more information, see https://aka.ms/dataset-deprecation.

from_binary_files(path)

Parameters

path
DataReference or str

A data path in a registered datastore or a local path.

Returns

The Dataset object.

Return type

Remarks

Use this method to read files as streams of binary data. Returns one file stream object per file read. Use this method when you're reading images, videos, audio or other binary data.

get_profile(arguments=None, generate_if_not_exist=True, workspace=None, compute_target=None) and create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None) will not work as expected for a Dataset created by this method.

The returned Dataset is not registered with the workspace.

from_delimited_files(path, separator=',', header=<PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS: 3>, encoding=<FileEncoding.UTF8: 0>, quoting=False, infer_column_types=True, skip_rows=0, skip_mode=<SkipLinesBehavior.NO_ROWS: 0>, comment=None, include_path=False, archive_options=None, partition_format=None)

Create an unregistered, in-memory Dataset from delimited files.

Note

This method is deprecated. Use Dataset.Tabular.from_delimited_files instead.

For more information, see https://aka.ms/dataset-deprecation.

from_delimited_files(path, separator=',', header=<PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS: 3>, encoding=<FileEncoding.UTF8: 0>, quoting=False, infer_column_types=True, skip_rows=0, skip_mode=<SkipLinesBehavior.NO_ROWS: 0>, comment=None, include_path=False, archive_options=None, partition_format=None)

Parameters

path
DataReference or str

A data path in a registered datastore, a local path, or an HTTP URL.

separator
str

The separator used to split columns.

header

Controls how column headers are promoted when reading from files.

encoding

The encoding of the files being read.

quoting
bool or optional

Specify how to handle new line characters within quotes. The default (False) is to interpret new line characters as starting new rows, irrespective of whether the new line characters are within quotes or not. If set to True, new line characters inside quotes will not result in new rows, and file reading speed will slow down.

infer_column_types
bool or optional

Indicates whether column data types are inferred.

skip_rows
int or optional

How many rows to skip in the file(s) being read.

skip_mode
SkipLinesBehavior or optional

Controls how rows are skipped when reading from files.

comment
str or optional

Character used to indicate comment lines in the files being read. Lines beginning with this string will be skipped.

include_path
bool or optional

Whether to include a column containing the path of the file from which the data was read. This is useful when you are reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

archive_options
azureml.dataprep.api._archiveoption.ArchiveOptions

Options for archive file, including archive type and entry glob pattern. We only support ZIP as archive type at the moment. For example, specifying


   archive_options = ArchiveOptions(archive_type = ArchiveType.ZIP, entry_glob = '*10-20.csv')

reads all files with name ending with "10-20.csv" in ZIP.

partition_format
str or optional

Specify the partition format in path and create string columns from format '{x}' and datetime column from format '{x:yyyy/MM/dd/HH/mm/ss}', where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extrat year, month, day, hour, minute and second for the datetime type. The format should start from the postition of first partition key until the end of file path. For example, given a file path '../USA/2019/01/01/data.csv' and data is partitioned by country and time, we can define '/{Country}/{PartitionDate:yyyy/MM/dd}/data.csv' to create columns 'Country' of string type and 'PartitionDate' of datetime type.

Returns

Dataset object.

Return type

Remarks

Use this method to read delimited text files when you want to control the options used.

After creating a Dataset, you should use get_profile(arguments=None, generate_if_not_exist=True, workspace=None, compute_target=None) to list detected column types and summary statistics for each column.

The returned Dataset is not registered with the workspace.

from_excel_files(path, sheet_name=None, use_column_headers=False, skip_rows=0, include_path=False, infer_column_types=True, partition_format=None)

Create an unregistered, in-memory Dataset from Excel files.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

from_excel_files(path, sheet_name=None, use_column_headers=False, skip_rows=0, include_path=False, infer_column_types=True, partition_format=None)

Parameters

path
DataReference or str

A data path in a registered datastore or a local path.

sheet_name
str or optional

The name of the Excel sheet to load. By default we read the first sheet from each Excel file.

use_column_headers
bool or optional

Controls whether to use the first row as column headers.

skip_rows
int or optional

How many rows to skip in the file(s) being read.

include_path
bool or optional

Whether to include a column containing the path of the file from which the data was read. This is useful when you are reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

infer_column_types
bool or optional

If true, column data types will be inferred.

partition_format
str or optional

Specify the partition format in path and create string columns from format '{x}' and datetime column from format '{x:yyyy/MM/dd/HH/mm/ss}', where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extrat year, month, day, hour, minute and second for the datetime type. The format should start from the postition of first partition key until the end of file path. For example, given a file path '../USA/2019/01/01/data.csv' and data is partitioned by country and time, we can define '/{Country}/{PartitionDate:yyyy/MM/dd}/data.csv' to create columns 'Country' of string type and 'PartitionDate' of datetime type.

Returns

Dataset object.

Return type

Remarks

Use this method to read Excel files in .xlsx format. Data can be read from one sheet in each Excel file. After creating a Dataset, you should use get_profile(arguments=None, generate_if_not_exist=True, workspace=None, compute_target=None) to list detected column types and summary statistics for each column. The returned Dataset is not registered with the workspace.

from_json_files(path, encoding=<FileEncoding.UTF8: 0>, flatten_nested_arrays=False, include_path=False, partition_format=None)

Create an unregistered, in-memory Dataset from JSON files.

Note

This method is deprecated. Use Dataset.Tabular.from_json_lines_files instead to read

from JSON lines file. For more information, see https://aka.ms/dataset-deprecation.

from_json_files(path, encoding=<FileEncoding.UTF8: 0>, flatten_nested_arrays=False, include_path=False, partition_format=None)

Parameters

path
DataReference or str

The path to the file(s) or folder(s) that you want to load and parse. It can either be a local path or an Azure Blob url. Globbing is supported. For example, you can use path = "./data*" to read all files with name starting with "data".

encoding
FileEncoding

The encoding of the files being read.

flatten_nested_arrays
bool or optional

Property controlling program's handling of nested arrays. If you choose to flatten nested JSON arrays, it could result in a much larger number of rows.

include_path
bool or optional

Whether to include a column containing the path from which the data was read. This is useful when you are reading multiple files, and might want to know which file a particular record originated from, or to keep useful information in file path.

partition_format
str or optional

Specify the partition format in path and create string columns from format '{x}' and datetime column from format '{x:yyyy/MM/dd/HH/mm/ss}', where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extrat year, month, day, hour, minute and second for the datetime type. The format should start from the postition of first partition key until the end of file path. For example, given a file path '../USA/2019/01/01/data.csv' and data is partitioned by country and time, we can define '/{Country}/{PartitionDate:yyyy/MM/dd}/data.csv' to create columns 'Country' of string type and 'PartitionDate' of datetime type.

Returns

The local Dataset object.

Return type

from_pandas_dataframe(dataframe, path=None, in_memory=False)

Create an unregistered, in-memory Dataset from a pandas dataframe.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

from_pandas_dataframe(dataframe, path=None, in_memory=False)

Parameters

dataframe
DataFrame

The Pandas DataFrame.

path
DataReference or str

A data path in registered datastore or local folder path.

in_memory
bool or optional

Whether to read the DataFrame from memory instead of persisting to disk.

Returns

A Dataset object.

Return type

Remarks

Use this method to convert a Pandas dataframe to a Dataset object. A Dataset created by this method can not be registered, as the data is from memory.

If in_memory is False, the Pandas DataFrame is converted to a CSV file locally. If pat is of type DataReference, then the Pandas frame will be uploaded to the data store, and the Dataset will be based off the DataReference. If ``path` is a local folder, the Dataset will be created off of the local file which cannot be deleted.

Raises an exception if the current DataReference is not a folder path.

from_parquet_files(path, include_path=False, partition_format=None)

Create an unregistered, in-memory Dataset from parquet files.

Note

This method is deprecated. Use Dataset.Tabular.from_parquet_files instead.

For more information, see https://aka.ms/dataset-deprecation.

from_parquet_files(path, include_path=False, partition_format=None)

Parameters

path
DataReference or str

A data path in a registered datastore or a local path.

include_path
bool or optional

Whether to include a column containing the path of the file from which the data was read. This is useful when you are reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

partition_format
str or optional

Specify the partition format in path and create string columns from format '{x}' and datetime column from format '{x:yyyy/MM/dd/HH/mm/ss}', where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extrat year, month, day, hour, minute and second for the datetime type. The format should start from the postition of first partition key until the end of file path. For example, given a file path '../USA/2019/01/01/data.csv' and data is partitioned by country and time, we can define '/{Country}/{PartitionDate:yyyy/MM/dd}/data.csv' to create columns 'Country' of string type and 'PartitionDate' of datetime type.

Returns

Dataset object.

Return type

Remarks

Use this method to read Parquet files.

After creating a Dataset, you should use get_profile(arguments=None, generate_if_not_exist=True, workspace=None, compute_target=None) to list detected column types and summary statistics for each column.

The returned Dataset is not registered with the workspace.

from_sql_query(data_source, query)

Create an unregistered, in-memory Dataset from a SQL query.

Note

This method is deprecated. Use Dataset.Tabular.from_sql_query instead.

For more information, see https://aka.ms/dataset-deprecation.

from_sql_query(data_source, query)

Parameters

data_source
AzureSqlDatabaseDatastore

The details of the Azure SQL datastore.

query
str

The query to execute to read data.

Returns

The local Dataset object.

Return type

generate_profile(compute_target=None, workspace=None, arguments=None)

Generate new profile for the Dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

generate_profile(compute_target=None, workspace=None, arguments=None)

Parameters

compute_target
ComputeTarget or str or optional

An optional compute target to perform the snapshot profile creation. If omitted, the local compute is used.

workspace
Workspace or optional

Workspace, required for transient(unregistered) Datasets.

arguments
dict[str or object] or optional

Profile arguments. Valid arguments are:

  • 'include_stype_counts' of type bool. Check if values look like some well known semantic types such as email address, IP Address (V4/V6), US phone number, US zipcode, Latitude/Longitude. Enabling this impacts performance.

  • 'number_of_histogram_bins' of type int. Represents the number of histogram bins to use for numeric data. The default value is 10.

Returns

Dataset action run object.

Return type

Remarks

Synchronous call, will block till it completes. Call get_result() to get the result of the action.

get(workspace, name=None, id=None)

Get a Dataset that already exists in the workspace by specifying either its name or ID.

Note

This method is deprecated. Use

get_by_name(workspace, name, version='latest') and get_by_id(workspace, id) instead.

For more information, see https://aka.ms/dataset-deprecation.

get(workspace, name=None, id=None)

Parameters

workspace
Workspace

The existing AzureML workspace in which the Dataset was created.

name
str or optional

The name of the Dataset to be retrieved.

id
str or optional

A unique identifier of the Dataset in the workspace.

Returns

The Dataset with the specified name or ID.

Return type

Remarks

You can provide either name or id. An exception is raised if:

  • both name and id are specified but don't match.

  • the Dataset with the specified name or id cannot be found in the workspace.

get_all(workspace)

Get all the registered datasets in the workspace.

get_all(workspace)

Parameters

workspace
Workspace

The existing AzureML workspace in which the Datasets were registered.

Returns

A dictionary of TabularDataset and FileDataset objects keyed by their registration name.

Return type

dict[str, azureml.data.TabularDataset | azureml.data.FileDataset]

get_all_snapshots()

Get all snapshots of the Dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

get_all_snapshots()

Returns

List of Dataset snapshots.

Return type

get_by_id(workspace, id)

Get a Dataset which is saved to the workspace.

get_by_id(workspace, id)

Parameters

workspace
Workspace

The existing AzureML workspace in which the Dataset is saved.

id
str

The id of dataset.

Returns

The dataset object. If dataset is registered, its registration name and version will also be returned.

Return type

azureml.data.TabularDataset | azureml.data.FileDataset

get_by_name(workspace, name, version='latest')

Get a registered Dataset from workspace by its registration name.

get_by_name(workspace, name, version='latest')

Parameters

workspace
Workspace

The existing AzureML workspace in which the Dataset was registered.

name
str

The registration name.

version
int

The registration version. Defaults to 'latest'.

Returns

The registered dataset object.

Return type

azureml.data.TabularDataset | azureml.data.FileDataset

get_definition(version_id=None)

Get a specific definition of the Dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

get_definition(version_id=None)

Parameters

version_id
str or optional

The version ID of the Dataset definition

Returns

The Dataset definition.

Return type

Remarks

If version_id is provided, then Azure Machine Learning tries to get the definition corresponding to that version. If that version does not exist, an exception is thrown. If version_id is omitted, then the latest version is retrieved.

get_definitions()

Get all the definitions of the Dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

get_definitions()

Returns

A dictionary of Dataset definitions.

Return type

Remarks

A Dataset registered in an AzureML workspace can have multiple definitions, each created by calling update_definition(definition, definition_update_message). Each definition has an unique identifier. The current definition is the latest one created.

For unregistered Datasets, only one definition exists.

get_profile(arguments=None, generate_if_not_exist=True, workspace=None, compute_target=None)

Get summary statistics on the Dataset computed earlier.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

get_profile(arguments=None, generate_if_not_exist=True, workspace=None, compute_target=None)

Parameters

arguments
dict[str or object] or optional

Profile arguments.

generate_if_not_exist
bool or optional

Indicates whether to generate a profile if it doesn't exist.

workspace
Workspace or optional

Workspace, required for transient(unregistered) Datasets.

compute_target
ComputeTarget or str or optional

A compute target to execute the profile action, optional.

Returns

DataProfile of the Dataset.

Return type

Remarks

For a Dataset registered with an Azure Machine Learning workspace, this method retrieves an existing profile that was created earlier by calling get_profile if it is still valid. Profiles are invalidated when changed data is detected in the Dataset or the arguments to get_profile are different from the ones used when the profile was generated. If the profile is not present or invalidated, generate_if_not_exist will determine if a new profile is generated.

For a Dataset that is not registered with an Azure Machine Learning workspace, this method always runs generate_profile(compute_target=None, workspace=None, arguments=None) and returns the result.

get_snapshot(snapshot_name)

Get snapshot of the Dataset by name.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

get_snapshot(snapshot_name)

Parameters

snapshot_name
str

The snapshot name.

Returns

Dataset snapshot object.

Return type

head(count)

Pull the specified number of records specified from this Dataset and returns them as a DataFrame.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

head(count)

Parameters

count
int

The number of records to pull.

Returns

A Pandas DataFrame.

Return type

list(workspace)

List all the Datasets in the workspace, including ones with is_visible property equal to False.

Note

This property is deprecated. Use get_all(workspace) instead.

For more information, see https://aka.ms/dataset-deprecation.

list(workspace)

Parameters

workspace
Workspace

The workspace for which you want to retrieve the list of Datasets.

Returns

A list of Dataset objects.

Return type

reactivate()

Reactivate an archived or deprecated dataset.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

reactivate()

Returns

None.

Return type

register(workspace, name, description=None, tags=None, visible=True, exist_ok=False, update_if_exist=False)

Register the Dataset in the workspace, making it available to other users of the workspace.

register(workspace, name, description=None, tags=None, visible=True, exist_ok=False, update_if_exist=False)

Parameters

workspace
Workspace

The AzureML workspace in which the Dataset is to be registered.

name
str

The name of the Dataset in the workspace.

description
str or optional

A description of the Dataset.

tags
dict[str or str] or optional

Tags to associate with the Dataset.

visible
bool or optional

Indicates whether the Dataset is visible in the UI. If False, then the Dataset is hidden in the UI and available via SDK.

exist_ok
bool or optional

If True, the method returns the Dataset if it already exists in the given workspace, else error.

update_if_exist
bool or optional

If exist_ok is True and update_if_exist is True, this method will update the definition and return the updated Dataset.

Returns

A registered Dataset object in the workspace.

Return type

sample(sample_strategy, arguments)

Generate a new sample from the source Dataset, using the sampling strategy and parameters provided.

Note

This method is deprecated. Create a

TabularDataset by calling the static methods on

Dataset.Tabular and use the take_sample(probability, seed=None) method there.

For more information, see https://aka.ms/dataset-deprecation.

sample(sample_strategy, arguments)

Parameters

sample_strategy
str

Sample strategy to use. Accepted values are "top_n", "simple_random", or "stratified".

arguments
dict[str or object]

A dictionary with keys from the "Optional argument" in the list shown above, and values from tye "Type" column. Only arguments from the corresponding sampling method can be used. For example, for a "simple_random" sample type, you can only specify a dictionary with "probability" and "seed" keys.

Returns

Dataset object as a sample of the original dataset.

Return type

Remarks

Samples are generated by executing the transformation pipeline defined by this Dataset, and then applying the sampling strategy and parameters to the output data. Each sampling method supports the following optional arguments:

  • top_n

    • Optional arguments

      • n, type integer. Select top N rows as your sample.
  • simple_random

    • Optional arguments

      • probability, type float. Simple random sampling where each row has equal probability of being selected. Probability should be a number between 0 and 1.

      • seed, type float. Used by random number generator. Use for repeatability.

  • stratified

    • Optional arguments

      • columns, type list[str]. List of strata columns in the data.

      • seed, type float. Used by random number generator. Use for repeatability.

      • fractions, type dict[tuple, float]. Tuple: column values that define a stratum, must be in the same order as column names. Float: weight attached to a stratum during sampling.

The following code snippets are example design patterns for different sample methods.


   # sample_strategy "top_n"
   top_n_sample_dataset = dataset.sample('top_n', {'n': 5})

   # sample_strategy "simple_random"
   simple_random_sample_dataset = dataset.sample('simple_random', {'probability': 0.3, 'seed': 10.2})

   # sample_strategy "stratified"
   fractions = {}
   fractions[('THEFT',)] = 0.5
   fractions[('DECEPTIVE PRACTICE',)] = 0.2

   # take 50% of records with "Primary Type" as THEFT and 20% of records with "Primary Type" as
   # DECEPTIVE PRACTICE into sample Dataset
   sample_dataset = dataset.sample('stratified', {'columns': ['Primary Type'], 'fractions': fractions})

to_pandas_dataframe()

Create a Pandas dataframe by executing the transformation pipeline defined by this Dataset definition.

Note

This method is deprecated. Create a

TabularDataset by calling the static methods on

Dataset.Tabular and use the to_pandas_dataframe()

method there. For more information, see https://aka.ms/dataset-deprecation.

to_pandas_dataframe()

Returns

A Pandas DataFrame.

Return type

Remarks

Return a Pandas DataFrame fully materialized in memory.

to_spark_dataframe()

Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataset definition.

Note

This method is deprecated. Create a

TabularDataset by calling the static methods on Dataset.Tabular

and use the to_spark_dataframe() method there.

For more information, see https://aka.ms/dataset-deprecation.

to_spark_dataframe()

Returns

A Spark DataFrame.

Return type

Remarks

The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated.

update(name=None, description=None, tags=None, visible=None)

Update the Dataset mutable attributes in the workspace and return the updated Dataset from the workspace.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

update(name=None, description=None, tags=None, visible=None)

Parameters

name
str or optional

The name of the Dataset in the workspace.

description
str or optional

A description of the data.

tags
dict[str or str] or optional

Tags to associate the Dataset with.

visible
bool or optional

Indicates whether the the Dataset is visible in the UI.

Returns

An updated Dataset object from the workspace.

Return type

update_definition(definition, definition_update_message)

Update the Dataset definition.

Note

This method is deprecated. For more information, see https://aka.ms/dataset-deprecation.

update_definition(definition, definition_update_message)

Parameters

definition
DatasetDefinition

The new definition of this Dataset.

definition_update_message
str

The definition update message.

Returns

An updated Dataset object from the workspace.

Return type

Remarks

To consume the updated Dataset, use the object returned by this method.

Attributes

definition

Return the current Dataset definition.

Note

This property is deprecated. For more information, see https://aka.ms/dataset-deprecation.

Returns

The Dataset definition.

Return type

Remarks

A Dataset definition is a series of steps that specify how to read and transform data.

A Dataset registered in an AzureML workspace can have multiple definitions, each created by calling update_definition(definition, definition_update_message). Each definition has an unique identifier. Having multiple definitions allows you to make changes to existing Datasets without breaking models and pipelines that depend on the older definition.

For unregistered Datasets, only one definition exists.

definition_version

Return the version of the current definition of the Dataset.

Note

This property is deprecated. For more information, see https://aka.ms/dataset-deprecation.

Returns

The Dataset definition version.

Return type

str

Remarks

A Dataset definition is a series of steps that specify how to read and transform data.

A Dataset registered in an AzureML workspace can have multiple definitions, each created by calling update_definition(definition, definition_update_message). Each definition has an unique identifier. The current definition is the latest one created, whose ID is returned by this.

For unregistered Datasets, only one definition exists.

description

Return the description of the Dataset.

Returns

The Dataset description.

Return type

str

Remarks

Specifying a description of the data in the Dataset enables users of the workspace to understand what the data represents, and how they can use it.

id

If the Dataset was registered in a workspace, return the ID of the Dataset. Otherwise, return None.

Returns

The Dataset ID.

Return type

str

is_visible

Control the visibility of a registered Dataset in the Azure ML workspace UI.

Note

This property is deprecated. For more information, see https://aka.ms/dataset-deprecation.

Returns

The Dataset visibility.

Return type

Remarks

Values returned:

  • True: Dataset is visible in workspace UI. Default.

  • False: Dataset is hidden in workspace UI.

Has no effect on unregistered Datasets.

name

Return the Dataset name.

Returns

The Dataset name.

Return type

str

state

Return the state of the Dataset.

Note

This property is deprecated. For more information, see https://aka.ms/dataset-deprecation.

Returns

The Dataset state.

Return type

str

Remarks

The meaning and effect of states are as follows:

  • Active. Active definitions are exactly what they sound like, all actions can be performed on active definitions.

  • Deprecated. deprecated definition can be used, but will result in a warning being logged in the logs everytime the underlying data is accessed.

  • Archived. An archived definition cannot be used to perform any action. To perform actions on an archived definition, it must be reactivated.

tags

Return the tags associated with the Dataset.

Returns

Dataset tags.

Return type

workspace

If the Dataset was registered in a workspace, return that. Otherwise, return None.

Returns

The workspace.

Return type