data Package

Contains modules supporting data representation for Datastore and Dataset in Azure Machine Learning.

This package contains core functionality supporting Datastore and Dataset classes in the core package. Datastore objects contain connection information to Azure storage services that can be easily referred to by name without the need to work directly with or hard code connection information in scripts. Datastore supports a number of different services represented by classes in this package, including AzureBlobDatastore, AzureFileDatastore, and AzureDataLakeDatastore. For a full list of supported storage services, see the Datastore class.

While a Datastore acts as a container for your data files, you can think of a Dataset as a reference or pointer to specific data that's in your datastore. The following Datasets types are supported:

  • TabularDataset represents data in a tabular format created by parsing the provided file or list of files.

  • FileDataset references single or multiple files in your datastores or public URLs.

For more information, see the article Add & register datasets. To get started working with a datasets, see https://aka.ms/tabulardataset-samplenotebook and https://aka.ms/filedataset-samplenotebook.

Modules

abstract_dataset

Contains the abstract base class for datasets in Azure Machine Learning.

abstract_datastore

Contains the base functionality for datastores that save connection information to Azure storage services.

azure_data_lake_datastore

Contains the base functionality for datastores that save connection information to Azure Data Lake Storage.

azure_my_sql_datastore

Contains the base functionality for datastores that save connection information to Azure Database for MySQL.

azure_postgre_sql_datastore

Contains the base functionality for datastores that save connection information to Azure Database for PostgreSQL.

azure_sql_database_datastore

Contains the base functionality for datastores that save connection information to Azure SQL database.

azure_storage_datastore

Contains functionality for datastores that save connection information to Azure Blob and Azure File storage.

constants

Constants used in the azureml.data package. Internal use only.

context_managers

Contains functionality to manage data context of datastores and datasets. Internal use only.

data_reference

Contains functionality that defines how to create references to data in datastores.

datacache

Contains functionality for managing DatacacheStore and Datacache in Azure Machine Learning.

datacache_client

Internal use only.

datacache_consumption_config

Contains functionality for DataCache consumption configuration.

datapath

Contains functionality to create references to data in datastores.

This module contains the DataPath class, which represents the location of data, and the DataPathComputeBinding class, which represents how the data is made available on the compute targets.

dataset_action_run

Contains functionality that manages the execution of Dataset actions.

This module provides convenience methods for creating Dataset actions and get their results after completion.

dataset_consumption_config

Contains functionality for Dataset consumption configuration.

dataset_definition

Contains functionality to manage dataset definition and its operations.

Note

This module is deprecated. For more information, see https://aka.ms/dataset-deprecation.

dataset_error_handling

Contains exceptions for dataset error handling in Azure Machine Learning.

dataset_factory

Contains functionality to create datasets for Azure Machine Learning.

dataset_profile

Class for collecting summary statistics on the data produced by a Dataflow.

Functionality in this module includes collecting information regarding which run produced the profile, whether the profile is stale or not.

dataset_profile_run

Contains configuration for monitoring dataset profile run in Azure Machine Learning.

Functionality in this module includes handling and monitoring dataset profile run associated with an experiment object and individual run id.

dataset_profile_run_config

Contains configuration to generate statistics summary of datasets in Azure Machine Learning.

Functionality in this module includes methods for submitting local or remote profile run and visualizing the result of the submitted profile run.

dataset_snapshot

Contains functionality to manage Dataset snapshot operations.

Note

This module is deprecated. For more information, see https://aka.ms/dataset-deprecation.

dataset_type_definitions

Contains enumeration values used with Dataset.

datastore_client

Internal use only.

dbfs_datastore

Contains functionality for datastores that save connection information to Databricks File Sytem (DBFS).

file_dataset

Contains functionality for referencing single or multiple files in datastores or public URLs.

For more information, see the article Add & register datasets. To get started working with a file dataset, see https://aka.ms/filedataset-samplenotebook.

output_dataset_config

Contains configurations that specifies how outputs for a job should be uploaded and promoted to a dataset.

For more information, see the article how to specify outputs.

sql_data_reference

Contains functionality for creating references to data in datastores that save connection info to SQL databases.

stored_procedure_parameter

Contains functionality for creating a parameter to pass to a SQL stored procedure.

tabular_dataset

Contains functionality for representing data in a tabular format by parsing the provided file or list of files.

For more information, see the article Add & register datasets. To get started working with a tabular dataset, see https://aka.ms/tabulardataset-samplenotebook.

Classes

DataType

Configures column data types for a dataset created in Azure Machine Learning.

DataType methods are used in the TabularDatasetFactory class from_* methods, which are used to create new TabularDataset objects.

DatacacheStore

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Represents a storage abstraction over an Azure Machine Learning storage account.

DatacacheStores are attached to workspaces and are used to store information related to the underlying datacache solution. Currently, only partitioned blob solution is supported. Datacachestores defines various Blob datastores that could be used for caching.

Use this class to perform management operations, including register, list, get, and update datacachestores. DatacacheStores for each service are created with the register* methods of this class.

FileDataset

Represents a collection of file references in datastores or public URLs to use in Azure Machine Learning.

A FileDataset defines a series of lazily-evaluated, immutable operations to load data from the data source into file streams. Data is not loaded from the source until FileDataset is asked to deliver data.

A FileDataset is created using the from_files method of the FileDatasetFactory class.

For more information, see the article Add & register datasets. To get started working with a file dataset, see https://aka.ms/filedataset-samplenotebook.

HDFSOutputDatasetConfig

Represent how to output to a HDFS path and be promoted as a FileDataset.

LinkFileOutputDatasetConfig

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Represent how to link the output of a run and be promoted as a FileDataset.

The LinkFileOutputDatasetConfig allows you to link a file dataset as output dataset


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = LinkFileOutputDatasetConfig('link_output')

   script_run_config = ScriptRunConfig('.', 'link.py', arguments=[output])

   # within link.py
   # from azureml.core import Run, Dataset
   # run = Run.get_context()
   # workspace = run.experiment.workspace
   # dataset = Dataset.get_by_name(workspace, name='dataset_to_link')
   # run.output_datasets['link_output'].link(dataset)

   run = experiment.submit(script_run_config)
   print(run)
LinkTabularOutputDatasetConfig

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Represent how to link the output of a run and be promoted as a TabularDataset.

The LinkTabularOutputDatasetConfig allows you to link a file Tabular as output dataset


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = LinkTabularOutputDatasetConfig('link_output')

   script_run_config = ScriptRunConfig('.', 'link.py', arguments=[output])

   # within link.py
   # from azureml.core import Run, Dataset
   # run = Run.get_context()
   # workspace = run.experiment.workspace
   # dataset = Dataset.get_by_name(workspace, name='dataset_to_link')
   # run.output_datasets['link_output'].link(dataset)

   run = experiment.submit(script_run_config)
   print(run)
OutputFileDatasetConfig

Represent how to copy the output of a run and be promoted as a FileDataset.

The OutputFileDatasetConfig allows you to specify how you want a particular local path on the compute target to be uploaded to the specified destination. If no arguments are passed to the constructor, we will automatically generate a name, a destination, and a local path.

An example of not passing any arguments:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = OutputFileDatasetConfig()

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)

An example of creating an output then promoting the output to a tabular dataset and register it with name foo:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   datastore = Datastore(workspace, 'example_adls_gen2_datastore')

   # for more information on the parameters and methods, please look for the corresponding documentation.
   output = OutputFileDatasetConfig().read_delimited_files().register_on_complete('foo')

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)
TabularDataset

Represents a tabular dataset to use in Azure Machine Learning.

A TabularDataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation. Data is not loaded from the source until TabularDataset is asked to deliver data.

TabularDataset is created using methods like from_delimited_files from the TabularDatasetFactory class.

For more information, see the article Add & register datasets. To get started working with a tabular dataset, see https://aka.ms/tabulardataset-samplenotebook.