TabularDataset Class

Reference

Represents a tabular dataset to use in Azure Machine Learning.

A TabularDataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation. Data is not loaded from the source until TabularDataset is asked to deliver data.

TabularDataset is created using methods like from_delimited_files from the TabularDatasetFactory class.

For more information, see the article Add & register datasets. To get started working with a tabular dataset, see https://aka.ms/tabulardataset-samplenotebook.

Initialize a TabularDataset object.

This constructor is not supposed to be invoked directly. Dataset is intended to be created using TabularDatasetFactory class.

Inheritance: AbstractDataset

TabularDataset

Constructor

TabularDataset()

Remarks

A TabularDataset can be created from CSV, TSV, Parquet files, or SQL query using the from_* methods of the TabularDatasetFactory class. You can perform subsetting operations on a TabularDataset like splitting, skipping, and filtering records. The result of subsetting is always one or more new TabularDataset objects.

You can also convert a TabularDataset into other formats like a pandas DataFrame. The actual data loading happens when TabularDataset is asked to deliver the data into another storage mechanism (e.g. a Pandas Dataframe, or a CSV file).

TabularDataset can be used as input of an experiment run. It can also be registered to workspace with a specified name and be retrieved by that name later.

Methods

download	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Download file streams defined by the dataset to local path.
drop_columns	Drop the specified columns from the dataset. If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset as well.
filter	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Filter the data, leaving only the records that match the specified expression.
get_profile	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Get data profile from the latest profile run submitted for this or the same dataset in the workspace.
get_profile_runs	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Return previous profile runs associated with this or same dataset in the workspace.
keep_columns	Keep the specified columns and drops all others from the dataset. If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset as well.
mount	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Create a context manager for mounting file streams defined by the dataset as local files.
partition_by	Partitioned data will be copied and output to the destination specified by target. create the dataset from the outputted data path with partition format, register dataset if name is provided, return the dataset for the new data path with partitions `ds = Dataset.get_by_name('test') # indexed by country, state, partition_date # #1: call partition_by locally new_ds = ds.partition_by(name="repartitioned_ds", partition_keys=['country'], target=DataPath(datastore, "repartition")) partition_keys = newds.partition_keys # ['country'] # new_ds can be passed to PRS as input dataset`
random_split	Split records in the dataset into two parts randomly and approximately by the percentage specified. The first dataset contains approximately `percentage` of the total records and the second dataset the remaining records.
skip	Skip records from top of the dataset by the specified count.
submit_profile_run	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Submit an experimentation run to calculate data profile. A data profile can be very useful to understand the input data, identify anomalies and missing values by providing useful information about the data like column type, missing values, etc.
take	Take a sample of records from top of the dataset by the specified count.
take_sample	Take a random sample of records in the dataset approximately by the probability specified.
time_after	Filter TabularDataset with time stamp columns after a specified start time.
time_before	Filter TabularDataset with time stamp columns before a specified end time.
time_between	Filter TabularDataset between a specified start and end time.
time_recent	Filter TabularDataset to contain only the specified duration (amount) of recent data.
to_csv_files	Convert the current dataset into a FileDataset containing CSV files. The resulting dataset will contain one or more CSV files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.
to_dask_dataframe	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Return a Dask DataFrame that can lazily read the data in the dataset.
to_pandas_dataframe	Load all records from the dataset into a pandas DataFrame.
to_parquet_files	Convert the current dataset into a FileDataset containing Parquet files. The resulting dataset will contain one or more Parquet files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.
to_spark_dataframe	Load all records from the dataset into a Spark DataFrame.
with_timestamp_columns	Define timestamp columns for the dataset.

download

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Download file streams defined by the dataset to local path.

download(stream_column, target_path=None, overwrite=False, ignore_not_found=True)

Parameters

Name	Description
stream_column Required	str The stream column to download.
target_path Required	str The local directory to download the files to. If None, the data will be downloaded into a temporary directory.
overwrite Required	bool Indicates whether to overwrite existing files. The default is False. Existing files will be overwritten if overwrite is set to True; otherwise an exception will be raised.
ignore_not_found Required	bool Indicates whether to fail download if some files pointed to by dataset are not found. The default is True. Download will fail if any file download fails for any reason if ignore_not_found is set to False; otherwise a waring will be logged for not found errors and dowload will succeed as long as no other error types are encountered.

Returns

Type	Description
ndarray	Returns an array of file paths for each file downloaded.

drop_columns

Drop the specified columns from the dataset.

If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset as well.

drop_columns(columns)

Parameters

Name	Description
columns Required	Union[str, list[str]] The name or a list of names for the columns to drop.

Returns

Type	Description
TabularDataset	Returns a new TabularDataset object with the specified columns dropped.

filter

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Filter the data, leaving only the records that match the specified expression.

filter(expression)

Parameters

Name	Description
expression Required	any The expression to evaluate.

Returns

Type	Description
TabularDataset	The modified dataset (unregistered).

Remarks

Expressions are started by indexing the Dataset with the name of a column. They support a variety of functions and operators and can be combined using logical operators. The resulting expression will be lazily evaluated for each record when a data pull occurs and not where it is defined.


   dataset['myColumn'] > dataset['columnToCompareAgainst']
   dataset['myColumn'].starts_with('prefix')

get_profile

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Get data profile from the latest profile run submitted for this or the same dataset in the workspace.

get_profile(workspace=None)

Parameters

Name	Description
workspace Required	Workspace The workspace where profile run was submitted. Defaults to the workspace of this dataset. Required if dataset is not associated to a workspace. See https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace for more information on workspaces.

Returns

Type	Description
DatasetProfile	Profile result from the latest profile run of type DatasetProfile.

get_profile_runs

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Return previous profile runs associated with this or same dataset in the workspace.

get_profile_runs(workspace=None)

Parameters

Name	Description
workspace Required	Workspace The workspace where profile run was submitted. Defaults to the workspace of this dataset. Required if dataset is not associated to a workspace. See https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace for more information on workspaces.

Returns

Type	Description
iter(Run)	iterator object of type azureml.core.Run.

keep_columns

Keep the specified columns and drops all others from the dataset.

If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset as well.

keep_columns(columns, validate=False)

Parameters

Name	Description
columns Required	Union[str, list[str]] The name or a list of names for the columns to keep.
validate Required	bool Indicates whether to validate if data can be loaded from the returned dataset. The default is False. Validation requires that the data source is accessible from current compute.

Returns

Type	Description
TabularDataset	Returns a new TabularDataset object with only the specified columns kept.

mount

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Create a context manager for mounting file streams defined by the dataset as local files.

mount(stream_column, mount_point=None)

Parameters

Name	Description
stream_column Required	str The stream column to mount.
mount_point Required	str The local directory to mount the files to. If None, the data will be mounted into a temporary directory, which you can find by calling the MountContext.mount_point instance method.

Returns

Type	Description
<xref:azureml.dataprep.fuse.daemon.MountContext>	Returns a context manager for managing the lifecycle of the mount.

partition_by

Partitioned data will be copied and output to the destination specified by target.

create the dataset from the outputted data path with partition format, register dataset if name is provided, return the dataset for the new data path with partitions


   ds = Dataset.get_by_name('test') # indexed by country, state, partition_date

   # #1: call partition_by locally
   new_ds = ds.partition_by(name="repartitioned_ds", partition_keys=['country'],
               target=DataPath(datastore, "repartition"))
   partition_keys = newds.partition_keys # ['country']

   # new_ds can be passed to PRS as input dataset

partition_by(partition_keys, target, name=None, show_progress=True, partition_as_file_dataset=False)

Parameters

Name	Description
partition_keys Required	list[str] Required, partition keys
target Required	DataPath, Datastore or tuple(Datastore, str) object Required, the datastore path where the dataframe parquet data will be uploaded to. A guid folder will be generated under the target path to avoid conflict.
name Required	str Optional, The registration name.
show_progress Required	bool Optional, indicates whether to show progress of the upload in the console. Defaults to be True.
partition_as_file_dataset Required	Optional, indicates whether returns a filedataset or not. Defaults to be False.

Returns

Type	Description
TabularDataset	The saved or registered dataset.

random_split

Split records in the dataset into two parts randomly and approximately by the percentage specified.

The first dataset contains approximately percentage of the total records and the second dataset the remaining records.

random_split(percentage, seed=None)

Parameters

Name	Description
percentage Required	float The approximate percentage to split the dataset by. This must be a number between 0.0 and 1.0.
seed Required	int Optional seed to use for the random generator.

Returns

Type	Description
(TabularDataset, TabularDataset)	Returns a tuple of new TabularDataset objects representing the two datasets after the split.

skip

Skip records from top of the dataset by the specified count.

skip(count)

Parameters

Name	Description
count Required	int The number of records to skip.

Returns

Type	Description
TabularDataset	Returns a new TabularDataset object representing a dataset with records skipped.

submit_profile_run

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Submit an experimentation run to calculate data profile.

A data profile can be very useful to understand the input data, identify anomalies and missing values by providing useful information about the data like column type, missing values, etc.

submit_profile_run(compute_target, experiment, cache_datastore_name=None)

Parameters

Name	Description
compute_target Required	Union[str, ComputeTarget] The compute target to run the profile calculation experiment on. Specify 'local' to use local compute. See https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.computetarget for more information on compute targets.
experiment Required	Experiment The experiment object. See https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment.experiment for more information on experiments.
cache_datastore_name Required	str the name of datastore to store the profile cache, if None, default datastore will be used

Returns

Type	Description
DatasetProfileRun	An object of type DatasetProfileRun class.

take

Take a sample of records from top of the dataset by the specified count.

take(count)

Parameters

Name	Description
count Required	int The number of records to take.

Returns

Type	Description
TabularDataset	Returns a new TabularDataset object representing the sampled dataset.

take_sample

Take a random sample of records in the dataset approximately by the probability specified.

take_sample(probability, seed=None)

Parameters

Name	Description
probability Required	float The probability of a record being included in the sample.
seed Required	int Optional seed to use for the random generator.

Returns

Type	Description
TabularDataset	Returns a new TabularDataset object representing the sampled dataset.

time_after

Filter TabularDataset with time stamp columns after a specified start time.

time_after(start_time, include_boundary=True, validate=True)

Parameters

Name	Description
start_time Required	datetime The lower bound for filtering data.
include_boundary Required	bool Indicate if the row associated with the boundary time (`start_time`) should be included.
validate Required	bool Indicates whether to validate if specified columns exist in dataset. The default is True. Validation requires that the data source is accessible from the current compute.

Returns

Type	Description
TabularDataset	A TabularDataset with the new filtered dataset.

time_before

Filter TabularDataset with time stamp columns before a specified end time.

time_before(end_time, include_boundary=True, validate=True)

Parameters

Name	Description
end_time Required	datetime Upper bound for filtering data.
include_boundary Required	bool Indicate if the row associated with the boundary time (`end_time`) should be included.
validate Required	bool Indicates whether to validate if specified columns exist in dataset. The default is True. Validation requires that the data source is accessible from the current compute.

Returns

Type	Description
TabularDataset	A TabularDataset with the new filtered dataset.

time_between

Filter TabularDataset between a specified start and end time.

time_between(start_time, end_time, include_boundary=True, validate=True)

Parameters

Name	Description
start_time Required	datetime The Lower bound for filtering data.
end_time Required	datetime The upper bound for filtering data.
include_boundary Required	bool Indicate if the row associated with the boundary time (`start_end` and `end_time`) should be included.
validate Required	bool Indicates whether to validate if specified columns exist in dataset. The default is True. Validation requires that the data source is accessible from the current compute.

Returns

Type	Description
TabularDataset	A TabularDataset with the new filtered dataset.

time_recent

Filter TabularDataset to contain only the specified duration (amount) of recent data.

time_recent(time_delta, include_boundary=True, validate=True)

Parameters

Name	Description
time_delta Required	timedelta The duration (amount) of recent data to retrieve.
include_boundary Required	bool Indicate if the row associated with the boundary time (`time_delta`) should be included.
validate Required	bool Indicates whether to validate if specified columns exist in dataset. The default is True. Validation requires that the data source is accessible from the current compute.

Returns

Type	Description
TabularDataset	A TabularDataset with the new filtered dataset.

to_csv_files

Convert the current dataset into a FileDataset containing CSV files.

The resulting dataset will contain one or more CSV files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.

to_csv_files(separator=',')

Parameters

Name	Description
separator Required	str The separator to use to separate values in the resulting file.

Returns

Type	Description
FileDataset	Returns a new FileDataset object with a set of CSV files containing the data in this dataset.

to_dask_dataframe

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Return a Dask DataFrame that can lazily read the data in the dataset.

to_dask_dataframe(sample_size=10000, dtypes=None, on_error='null', out_of_range_datetime='null')

Parameters

Name	Description
sample_size Required	The number of records to read to determine schema and types.
dtypes Required	An optional dict specifying the expected columns and their dtypes. sample_size is ignored if this is provided.
on_error Required	How to handle any error values in the dataset, such as those produced by an error while parsing values. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.
out_of_range_datetime Required	How to handle date-time values that are outside the range supported by Pandas. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.

Returns

Type	Description
	dask.dataframe.core.DataFrame

to_pandas_dataframe

Load all records from the dataset into a pandas DataFrame.

to_pandas_dataframe(on_error='null', out_of_range_datetime='null')

Parameters

Name	Description
on_error Required	How to handle any error values in the dataset, such as those produced by an error while parsing values. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.
out_of_range_datetime Required	How to handle date-time values that are outside the range supported by Pandas. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.

Returns

Type	Description
DataFrame	Returns a pandas DataFrame.

to_parquet_files

Convert the current dataset into a FileDataset containing Parquet files.

The resulting dataset will contain one or more Parquet files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.

to_parquet_files()

Returns

Type	Description
FileDataset	Returns a new FileDataset object with a set of Parquet files containing the data in this dataset.

to_spark_dataframe

Load all records from the dataset into a Spark DataFrame.

to_spark_dataframe()

Returns

Type	Description
DataFrame	Returns a Spark DataFrame.

with_timestamp_columns

Define timestamp columns for the dataset.

with_timestamp_columns(timestamp=None, partition_timestamp=None, validate=False, **kwargs)

Parameters

Name	Description
timestamp Required	str The name of column as timestamp (used to be referred as fine_grain_timestamp) (optional). The default is None(clear).
partition_timestamp Required	str The name of column partition_timestamp (used to be referred as coarse grain timestamp) (optional). The default is None(clear).
validate Required	bool Indicates whether to validate if specified columns exist in dataset. The default is False. Validation requires that the data source is accessible from the current compute.

Returns

Type	Description
TabularDataset	Returns a new TabularDataset with timestamp columns defined.

Remarks

The method defines columns to be used as timestamps. Timestamp columns on a dataset make it possible to treat the data as time-series data and enable additional capabilities. When a dataset has both timestamp (used to be referred as fine_grain_timestamp) and partition_timestamp (used to be referred as coarse grain timestamp) specified, the two columns should represent the same timeline.

Attributes

timestamp_columns

Return the timestamp columns.

Returns

Type	Description
(str, str)	The column names for timestamp (used to be referred as fine_grain_timestamp) and partition_timestamp (used to be referred as coarse grain timestamp) defined for the dataset.

TabularDataset Class

Constructor

Remarks

Methods

download

Parameters

Returns

drop_columns

Parameters

Returns

filter

Parameters

Returns

Remarks

get_profile

Parameters

Returns

get_profile_runs

Parameters

Returns

keep_columns

Parameters

Returns

mount

Parameters

Returns

partition_by

Parameters

Returns

random_split

Parameters

Returns

skip

Parameters

Returns

submit_profile_run

Parameters

Returns

take

Parameters

Returns

take_sample

Parameters

Returns

time_after

Parameters

Returns

time_before

Parameters

Returns

time_between

Parameters

Returns

time_recent

Parameters

Returns

to_csv_files

Parameters

Returns

to_dask_dataframe

Parameters

Returns

to_pandas_dataframe

Parameters

Returns

to_parquet_files

Returns

to_spark_dataframe

Returns

with_timestamp_columns

Parameters

Returns

Remarks

Attributes

timestamp_columns

Returns

Feedback

Feedback

Additional resources