TabularDataset class

Definition

The class that represents tabular dataset to use in Azure Machine Learning service.

A TabularDataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation.

Data is not loaded from the source until TabularDataset is asked to deliver data.

TabularDataset()
Inheritance
builtins.object
azureml.data._dataset._Dataset
TabularDataset

Remarks

TabularDataset is created using methods like from_delimited_files(path, validate=True, include_path=False, infer_column_types=True, set_column_types=None, separator=',', header=<PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS: 3>, partition_format=None) from TabularDatasetFactory class.

TabularDataset can be used as input of an experiment run. It can also be registered to workspace with a specified name and be retrieved by that name later.

TabularDataset can be subsetted by invoking different subsetting methods available on this class. The result of subsetting is always a new TabularDataset.

The actual data loading happens when TabularDataset is asked to deliver the data into another storage mechanism (e.g. a Pandas Dataframe, or a CSV file).

Methods

drop_columns(columns)

Drop the specified columns from the dataset.

keep_columns(columns, validate=False)

Keep the specified columns and drops all others from the dataset.

random_split(percentage, seed=None)

Split records in the dataset into two parts randomly and approximately by the percentage specified.

skip(count)

Skip records from top of the dataset by the specified count.

take(count)

Take a sample of records from top of the dataset by the specified count.

take_sample(probability, seed=None)

Take a random sample of records in the dataset approximately by the probability specified.

time_after(start_time, include_boundary=False)

Filter TabularDataset with time stamp columns after a specified start time.

time_before(end_time, include_boundary=False)

Filter TabularDataset with time stamp columns before a specified end time.

time_between(start_time, end_time, include_boundary=False)

Filter TabularDataset between a specified start and end time.

time_recent(time_delta, include_boundary=False)

Filter TabularDataset to contain only the most recent data specified by time_delta.

to_csv_files(separator=',')

Convert the current dataset into a FileDataset containing CSV files.

The resulting dataset will contain one or more CSV files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.

to_pandas_dataframe()

Load all records from the dataset into a pandas DataFrame.

to_parquet_files()

Convert the current dataset into a FileDataset containing Parquet files.

The resulting dataset will contain one or more Parquet files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.

to_spark_dataframe()

Load all records from the dataset into a spark DataFrame.

with_timestamp_columns(fine_grain_timestamp, coarse_grain_timestamp=None, validate=False)

Define timestamp columns for the dataset.

drop_columns(columns)

Drop the specified columns from the dataset.

drop_columns(columns)

Parameters

columns
str or List[str]

The name or a list of names for the columns to drop.

Returns

Returns a new TabularDataset object for dataset with the specified columns dropped.

Return type

keep_columns(columns, validate=False)

Keep the specified columns and drops all others from the dataset.

keep_columns(columns, validate=False)

Parameters

columns
str or List[str]

The name or a list of names for the columns to keep.

validate
bool

Boolean to validate if data can be loaded from the returned dataset. Defaults to True. Validation requires that the data source is accessible from current compute.

Returns

Returns a new TabularDataset object for dataset with only the specified columns kept.

Return type

random_split(percentage, seed=None)

Split records in the dataset into two parts randomly and approximately by the percentage specified.

random_split(percentage, seed=None)

Parameters

percentage
float

The approximate percentage to split the Dataflow by. This must be a number between 0.0 and 1.0.

seed
int

Optional seed to use for the random generator.

Returns

Returns a tuple of new TabularDataset objects for the tow split datasets.

Return type

skip(count)

Skip records from top of the dataset by the specified count.

skip(count)

Parameters

count
int

The number of records to skip.

Returns

Returns a new TabularDataset object for the dataset with record skipped.

Return type

take(count)

Take a sample of records from top of the dataset by the specified count.

take(count)

Parameters

count
int

The number of records to take.

Returns

Returns a new TabularDataset object for the sampled dataset.

Return type

take_sample(probability, seed=None)

Take a random sample of records in the dataset approximately by the probability specified.

take_sample(probability, seed=None)

Parameters

probability
float

The probability of a record being included in the sample.

seed
int

Optional seed to use for the random generator.

Returns

Returns a new TabularDataset object for the sampled dataset.

Return type

time_after(start_time, include_boundary=False)

Filter TabularDataset with time stamp columns after a specified start time.

time_after(start_time, include_boundary=False)

Parameters

start_time
datetime

Lower bound for filtering data.

include_boundary
bool

indicate if row associated with the boundary time (start_time) should be included.

Returns

A TabularDataset with the new filtered data

Return type

time_before(end_time, include_boundary=False)

Filter TabularDataset with time stamp columns before a specified end time.

time_before(end_time, include_boundary=False)

Parameters

end_time
datetime

Upper bound for filtering data.

include_boundary
bool

indicate if row associated with the boundary time (end_time) should be included.

Returns

A TabularDataset with the new filtered data

Return type

time_between(start_time, end_time, include_boundary=False)

Filter TabularDataset between a specified start and end time.

time_between(start_time, end_time, include_boundary=False)

Parameters

start_time
datetime

Lower bound for filtering data.

end_time
datetime

Upper bound for filtering data.

include_boundary
bool

indicate if row associated with the boundary time (start/end_time) should be included.

Returns

A TabularDataset with the new filtered data

Return type

time_recent(time_delta, include_boundary=False)

Filter TabularDataset to contain only the most recent data specified by time_delta.

time_recent(time_delta, include_boundary=False)

Parameters

time_delta
timedelta

Amount of recent data to retrieve.

include_boundary
bool

indicate if row associated with the boundary time (now-time_delta) should be included.

Returns

A TabularDataset with the new filtered data

Return type

to_csv_files(separator=',')

Convert the current dataset into a FileDataset containing CSV files.

The resulting dataset will contain one or more CSV files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.

to_csv_files(separator=',')

Parameters

separator
str

The separator to use to separate values in the resulting file.

Returns

Returns a new FileDataset object with a set of CSV files containing the data in this dataset.

Return type

to_pandas_dataframe()

Load all records from the dataset into a pandas DataFrame.

to_pandas_dataframe()

Returns

Returns a pandas DataFrame.

Return type

to_parquet_files()

Convert the current dataset into a FileDataset containing Parquet files.

The resulting dataset will contain one or more Parquet files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.

to_parquet_files()

Returns

Returns a new FileDataset object with a set of Parquet files containing the data in this dataset.

Return type

to_spark_dataframe()

Load all records from the dataset into a spark DataFrame.

to_spark_dataframe()

Returns

Returns a spark DataFrame.

Return type

with_timestamp_columns(fine_grain_timestamp, coarse_grain_timestamp=None, validate=False)

Define timestamp columns for the dataset.

with_timestamp_columns(fine_grain_timestamp, coarse_grain_timestamp=None, validate=False)

Parameters

fine_grain_timestamp
str

The name of column as fine grain timestamp, use None to clear it.

coarse_grain_timestamp
str

The name of column coarse grain timestamp (optional). Default is None (clear).

validate
bool

Boolean to validate if specified columns exist in dataset. Defaults to False. Validation requires that the data source is accessible from the current compute.

Returns

Returns a new TabularDataset with timestamp columns defined.

Return type

Remarks

Define the columns to be used as timestamps. Timestamp columns on a dataset make it possible to treat the data as time-series data and enables additional capabilities. When a dataset has both fine_grain_timestamp and coarse_grain_timestamp defined, the two columns should represent the same timeline.

Attributes

timestamp_columns

Return the timestamp columns.

Returns

The column names for fine_grain_timestamp and coarse_grain_timestamp defined for the dataset.

Return type

(str, str)