TabularDataset Class

Represents a tabular dataset to use in Azure Machine Learning.

A TabularDataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation. Data is not loaded from the source until TabularDataset is asked to deliver data.

TabularDataset is created using methods like from_delimited_files from the TabularDatasetFactory class.

For more information, see the article Add & register datasets. To get started working with a tabular dataset, see https://aka.ms/tabulardataset-samplenotebook.

Inheritance
TabularDataset

Constructor

TabularDataset()

Remarks

A TabularDataset can be created from CSV, TSV, Parquet files, or SQL query using the from_* methods of the TabularDatasetFactory class. You can perform subsetting operations on a TabularDataset like splitting, skipping, and filtering records. The result of subsetting is always one or more new TabularDataset objects.

You can also convert a TabularDataset into other formats like a pandas DataFrame. The actual data loading happens when TabularDataset is asked to deliver the data into another storage mechanism (e.g. a Pandas Dataframe, or a CSV file).

TabularDataset can be used as input of an experiment run. It can also be registered to workspace with a specified name and be retrieved by that name later.

Methods

drop_columns

Drop the specified columns from the dataset.

If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset as well.

get_profile

Note

This is an experimental method, and may change at any time.
For more information, see https://aka.ms/azuremlexperimental.

Get data profile from the latest profile run submitted for this or the same dataset in the workspace.

get_profile_runs

Note

This is an experimental method, and may change at any time.
For more information, see https://aka.ms/azuremlexperimental.

Return previous profile runs associated with this or same dataset in the workspace.

keep_columns

Keep the specified columns and drops all others from the dataset.

If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset as well.

random_split

Split records in the dataset into two parts randomly and approximately by the percentage specified.

The first dataset contains approximately percentage of the total records and the second dataset the remaining records.

skip

Skip records from top of the dataset by the specified count.

submit_profile_run

Note

This is an experimental method, and may change at any time.
For more information, see https://aka.ms/azuremlexperimental.

Submit an experimentation run to calculate data profile.

A data profile can be very useful to understand the input data, identify anomalies and missing values by providing useful information about the data like column type, missing values, etc.

take

Take a sample of records from top of the dataset by the specified count.

take_sample

Take a random sample of records in the dataset approximately by the probability specified.

time_after

Filter TabularDataset with time stamp columns after a specified start time.

time_before

Filter TabularDataset with time stamp columns before a specified end time.

time_between

Filter TabularDataset between a specified start and end time.

time_recent

Filter TabularDataset to contain only the specified duration (amount) of recent data.

to_csv_files

Convert the current dataset into a FileDataset containing CSV files.

The resulting dataset will contain one or more CSV files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.

to_pandas_dataframe

Load all records from the dataset into a pandas DataFrame.

to_parquet_files

Convert the current dataset into a FileDataset containing Parquet files.

The resulting dataset will contain one or more Parquet files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.

to_spark_dataframe

Load all records from the dataset into a Spark DataFrame.

with_timestamp_columns

Define timestamp columns for the dataset.

drop_columns

Drop the specified columns from the dataset.

If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset as well.

drop_columns(columns)

Parameters

columns
str or list[str]

The name or a list of names for the columns to drop.

Returns

Returns a new TabularDataset object with the specified columns dropped.

Return type

get_profile

Note

This is an experimental method, and may change at any time.
For more information, see https://aka.ms/azuremlexperimental.

Get data profile from the latest profile run submitted for this or the same dataset in the workspace.

get_profile(workspace=None)

Parameters

workspace
Workspace

The workspace where profile run was submitted. Defaults to the workspace of this dataset. Required if dataset is not associated to a workspace. See https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace for more information on workspaces.

Returns

Profile result from the latest profile run of type DatasetProfile.

Return type

get_profile_runs

Note

This is an experimental method, and may change at any time.
For more information, see https://aka.ms/azuremlexperimental.

Return previous profile runs associated with this or same dataset in the workspace.

get_profile_runs(workspace=None)

Parameters

workspace
Workspace

The workspace where profile run was submitted. Defaults to the workspace of this dataset. Required if dataset is not associated to a workspace. See https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace for more information on workspaces.

Returns

iterator object of type azureml.core.Run.

Return type

keep_columns

Keep the specified columns and drops all others from the dataset.

If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset as well.

keep_columns(columns, validate=False)

Parameters

columns
str or list[str]

The name or a list of names for the columns to keep.

validate
bool

Indicates whether to validate if data can be loaded from the returned dataset. The default is False. Validation requires that the data source is accessible from current compute.

Returns

Returns a new TabularDataset object with only the specified columns kept.

Return type

random_split

Split records in the dataset into two parts randomly and approximately by the percentage specified.

The first dataset contains approximately percentage of the total records and the second dataset the remaining records.

random_split(percentage, seed=None)

Parameters

percentage
float

The approximate percentage to split the dataset by. This must be a number between 0.0 and 1.0.

seed
int

Optional seed to use for the random generator.

Returns

Returns a tuple of new TabularDataset objects representing the two datasets after the split.

Return type

skip

Skip records from top of the dataset by the specified count.

skip(count)

Parameters

count
int

The number of records to skip.

Returns

Returns a new TabularDataset object representing a dataset with records skipped.

Return type

submit_profile_run

Note

This is an experimental method, and may change at any time.
For more information, see https://aka.ms/azuremlexperimental.

Submit an experimentation run to calculate data profile.

A data profile can be very useful to understand the input data, identify anomalies and missing values by providing useful information about the data like column type, missing values, etc.

submit_profile_run(compute_target, experiment)

Parameters

compute_target
str or ComputeTarget

The compute target to run the profile calculation experiment on. Specify 'local' to use local compute. See https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.computetarget for more information on compute targets.

experiment
Experiment

The experiment object. See https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment.experiment for more information on experiments.

Returns

An object of type DatasetProfileRun class.

Return type

take

Take a sample of records from top of the dataset by the specified count.

take(count)

Parameters

count
int

The number of records to take.

Returns

Returns a new TabularDataset object representing the sampled dataset.

Return type

take_sample

Take a random sample of records in the dataset approximately by the probability specified.

take_sample(probability, seed=None)

Parameters

probability
float

The probability of a record being included in the sample.

seed
int

Optional seed to use for the random generator.

Returns

Returns a new TabularDataset object representing the sampled dataset.

Return type

time_after

Filter TabularDataset with time stamp columns after a specified start time.

time_after(start_time, include_boundary=True, validate=True)

Parameters

start_time
datetime

The lower bound for filtering data.

include_boundary
bool

Indicate if the row associated with the boundary time (start_time) should be included.

validate
bool

Indicates whether to validate if specified columns exist in dataset. The default is True. Validation requires that the data source is accessible from the current compute.

Returns

A TabularDataset with the new filtered dataset.

Return type

time_before

Filter TabularDataset with time stamp columns before a specified end time.

time_before(end_time, include_boundary=True, validate=True)

Parameters

end_time
datetime

Upper bound for filtering data.

include_boundary
bool

Indicate if the row associated with the boundary time (end_time) should be included.

validate
bool

Indicates whether to validate if specified columns exist in dataset. The default is True. Validation requires that the data source is accessible from the current compute.

Returns

A TabularDataset with the new filtered dataset.

Return type

time_between

Filter TabularDataset between a specified start and end time.

time_between(start_time, end_time, include_boundary=True, validate=True)

Parameters

start_time
datetime

The Lower bound for filtering data.

end_time
datetime

The upper bound for filtering data.

include_boundary
bool

Indicate if the row associated with the boundary time (start_end and end_time) should be included.

validate
bool

Indicates whether to validate if specified columns exist in dataset. The default is True. Validation requires that the data source is accessible from the current compute.

Returns

A TabularDataset with the new filtered dataset.

Return type

time_recent

Filter TabularDataset to contain only the specified duration (amount) of recent data.

time_recent(time_delta, include_boundary=True, validate=True)

Parameters

time_delta
timedelta

The duration (amount) of recent data to retrieve.

include_boundary
bool

Indicate if the row associated with the boundary time (time_delta) should be included.

validate
bool

Indicates whether to validate if specified columns exist in dataset. The default is True. Validation requires that the data source is accessible from the current compute.

Returns

A TabularDataset with the new filtered dataset.

Return type

to_csv_files

Convert the current dataset into a FileDataset containing CSV files.

The resulting dataset will contain one or more CSV files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.

to_csv_files(separator=',')

Parameters

separator
str

The separator to use to separate values in the resulting file.

Returns

Returns a new FileDataset object with a set of CSV files containing the data in this dataset.

Return type

to_pandas_dataframe

Load all records from the dataset into a pandas DataFrame.

to_pandas_dataframe(on_error='null', out_of_range_datetime='null')

Parameters

on_error

How to handle any error values in the dataset, such as those produced by an error while parsing values. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.

out_of_range_datetime

How to handle date-time values that are outside the range supported by Pandas. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.

Returns

Returns a pandas DataFrame.

Return type

to_parquet_files

Convert the current dataset into a FileDataset containing Parquet files.

The resulting dataset will contain one or more Parquet files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.

to_parquet_files()

Returns

Returns a new FileDataset object with a set of Parquet files containing the data in this dataset.

Return type

to_spark_dataframe

Load all records from the dataset into a Spark DataFrame.

to_spark_dataframe()

Returns

Returns a Spark DataFrame.

Return type

with_timestamp_columns

Define timestamp columns for the dataset.

with_timestamp_columns(timestamp=None, partition_timestamp=None, validate=False, **kwargs)

Parameters

timestamp
str

The name of column as timestamp (used to be referred as fine_grain_timestamp) (optional). The default is None(clear).

partition_timestamp
str

The name of column partition_timestamp (used to be referred as coarse grain timestamp) (optional). The default is None(clear).

validate
bool

Indicates whether to validate if specified columns exist in dataset. The default is False. Validation requires that the data source is accessible from the current compute.

Returns

Returns a new TabularDataset with timestamp columns defined.

Return type

Remarks

The method defines columns to be used as timestamps. Timestamp columns on a dataset make it possible to treat the data as time-series data and enable additional capabilities. When a dataset has both timestamp (used to be referred as fine_grain_timestamp) and partition_timestamp (used to be referred as coarse grain timestamp) specified, the two columns should represent the same timeline.

Attributes

timestamp_columns

Return the timestamp columns.

Returns

The column names for timestamp (used to be referred as fine_grain_timestamp) and partition_timestamp (used to be referred as coarse grain timestamp) defined for the dataset.

Return type

(str, str)