TabularDataset Class
Represents a tabular dataset to use in Azure Machine Learning.
A TabularDataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation. Data is not loaded from the source until TabularDataset is asked to deliver data.
TabularDataset is created using methods like from_delimited_files from the TabularDatasetFactory class.
For more information, see the article Add & register datasets. To get started working with a tabular dataset, see https://aka.ms/tabulardataset-samplenotebook.
Initialize a TabularDataset object.
This constructor is not supposed to be invoked directly. Dataset is intended to be created using TabularDatasetFactory class.
- Inheritance
-
TabularDataset
Constructor
TabularDataset()
Remarks
A TabularDataset can be created from CSV, TSV, Parquet files, or SQL query using the from_*
methods of the TabularDatasetFactory class. You can
perform subsetting operations on a TabularDataset like splitting, skipping, and filtering records.
The result of subsetting is always one or more new TabularDataset objects.
You can also convert a TabularDataset into other formats like a pandas DataFrame. The actual data loading happens when TabularDataset is asked to deliver the data into another storage mechanism (e.g. a Pandas Dataframe, or a CSV file).
TabularDataset can be used as input of an experiment run. It can also be registered to workspace with a specified name and be retrieved by that name later.
Methods
download |
Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Download file streams defined by the dataset to local path. |
drop_columns |
Drop the specified columns from the dataset. If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset as well. |
filter |
Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Filter the data, leaving only the records that match the specified expression. |
get_profile |
Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Get data profile from the latest profile run submitted for this or the same dataset in the workspace. |
get_profile_runs |
Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Return previous profile runs associated with this or same dataset in the workspace. |
keep_columns |
Keep the specified columns and drops all others from the dataset. If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset as well. |
mount |
Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Create a context manager for mounting file streams defined by the dataset as local files. |
partition_by |
Partitioned data will be copied and output to the destination specified by target. create the dataset from the outputted data path with partition format, register dataset if name is provided, return the dataset for the new data path with partitions
|
random_split |
Split records in the dataset into two parts randomly and approximately by the percentage specified. The first dataset contains approximately |
skip |
Skip records from top of the dataset by the specified count. |
submit_profile_run |
Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Submit an experimentation run to calculate data profile. A data profile can be very useful to understand the input data, identify anomalies and missing values by providing useful information about the data like column type, missing values, etc. |
take |
Take a sample of records from top of the dataset by the specified count. |
take_sample |
Take a random sample of records in the dataset approximately by the probability specified. |
time_after |
Filter TabularDataset with time stamp columns after a specified start time. |
time_before |
Filter TabularDataset with time stamp columns before a specified end time. |
time_between |
Filter TabularDataset between a specified start and end time. |
time_recent |
Filter TabularDataset to contain only the specified duration (amount) of recent data. |
to_csv_files |
Convert the current dataset into a FileDataset containing CSV files. The resulting dataset will contain one or more CSV files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from. |
to_dask_dataframe |
Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Return a Dask DataFrame that can lazily read the data in the dataset. |
to_pandas_dataframe |
Load all records from the dataset into a pandas DataFrame. |
to_parquet_files |
Convert the current dataset into a FileDataset containing Parquet files. The resulting dataset will contain one or more Parquet files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from. |
to_spark_dataframe |
Load all records from the dataset into a Spark DataFrame. |
with_timestamp_columns |
Define timestamp columns for the dataset. |
download
Note
This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Download file streams defined by the dataset to local path.
download(stream_column, target_path=None, overwrite=False, ignore_not_found=True)
Parameters
- target_path
- str
The local directory to download the files to. If None, the data will be downloaded into a temporary directory.
- overwrite
- bool
Indicates whether to overwrite existing files. The default is False. Existing files will be overwritten if overwrite is set to True; otherwise an exception will be raised.
- ignore_not_found
- bool
Indicates whether to fail download if some files pointed to by dataset are not found. The default is True. Download will fail if any file download fails for any reason if ignore_not_found is set to False; otherwise a waring will be logged for not found errors and dowload will succeed as long as no other error types are encountered.
Returns
Returns an array of file paths for each file downloaded.
Return type
drop_columns
Drop the specified columns from the dataset.
If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset as well.
drop_columns(columns)
Parameters
Returns
Returns a new TabularDataset object with the specified columns dropped.
Return type
filter
Note
This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Filter the data, leaving only the records that match the specified expression.
filter(expression)
Parameters
Returns
The modified dataset (unregistered).
Return type
Remarks
Expressions are started by indexing the Dataset with the name of a column. They support a variety of functions and operators and can be combined using logical operators. The resulting expression will be lazily evaluated for each record when a data pull occurs and not where it is defined.
dataset['myColumn'] > dataset['columnToCompareAgainst']
dataset['myColumn'].starts_with('prefix')
get_profile
Note
This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Get data profile from the latest profile run submitted for this or the same dataset in the workspace.
get_profile(workspace=None)
Parameters
- workspace
- Workspace
The workspace where profile run was submitted. Defaults to the workspace of this dataset. Required if dataset is not associated to a workspace. See https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace for more information on workspaces.
Returns
Profile result from the latest profile run of type DatasetProfile.
Return type
get_profile_runs
Note
This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Return previous profile runs associated with this or same dataset in the workspace.
get_profile_runs(workspace=None)
Parameters
- workspace
- Workspace
The workspace where profile run was submitted. Defaults to the workspace of this dataset. Required if dataset is not associated to a workspace. See https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace for more information on workspaces.
Returns
iterator object of type azureml.core.Run.
Return type
keep_columns
Keep the specified columns and drops all others from the dataset.
If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset as well.
keep_columns(columns, validate=False)
Parameters
- validate
- bool
Indicates whether to validate if data can be loaded from the returned dataset. The default is False. Validation requires that the data source is accessible from current compute.
Returns
Returns a new TabularDataset object with only the specified columns kept.
Return type
mount
Note
This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Create a context manager for mounting file streams defined by the dataset as local files.
mount(stream_column, mount_point=None)
Parameters
- mount_point
- str
The local directory to mount the files to. If None, the data will be mounted into a temporary directory, which you can find by calling the MountContext.mount_point instance method.
Returns
Returns a context manager for managing the lifecycle of the mount.
Return type
partition_by
Partitioned data will be copied and output to the destination specified by target.
create the dataset from the outputted data path with partition format, register dataset if name is provided, return the dataset for the new data path with partitions
ds = Dataset.get_by_name('test') # indexed by country, state, partition_date
# #1: call partition_by locally
new_ds = ds.partition_by(name="repartitioned_ds", partition_keys=['country'],
target=DataPath(datastore, "repartition"))
partition_keys = newds.partition_keys # ['country']
# new_ds can be passed to PRS as input dataset
partition_by(partition_keys, target, name=None, show_progress=True, partition_as_file_dataset=False)
Parameters
Required, the datastore path where the dataframe parquet data will be uploaded to. A guid folder will be generated under the target path to avoid conflict.
- show_progress
- bool
Optional, indicates whether to show progress of the upload in the console. Defaults to be True.
- partition_as_file_dataset
Optional, indicates whether returns a filedataset or not. Defaults to be False.
Returns
The saved or registered dataset.
Return type
random_split
Split records in the dataset into two parts randomly and approximately by the percentage specified.
The first dataset contains approximately percentage
of the total records and the second dataset the
remaining records.
random_split(percentage, seed=None)
Parameters
- percentage
- float
The approximate percentage to split the dataset by. This must be a number between 0.0 and 1.0.
Returns
Returns a tuple of new TabularDataset objects representing the two datasets after the split.
Return type
skip
Skip records from top of the dataset by the specified count.
skip(count)
Parameters
Returns
Returns a new TabularDataset object representing a dataset with records skipped.
Return type
submit_profile_run
Note
This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Submit an experimentation run to calculate data profile.
A data profile can be very useful to understand the input data, identify anomalies and missing values by providing useful information about the data like column type, missing values, etc.
submit_profile_run(compute_target, experiment, cache_datastore_name=None)
Parameters
- compute_target
- Union[str, ComputeTarget]
The compute target to run the profile calculation experiment on. Specify 'local' to use local compute. See https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.computetarget for more information on compute targets.
- experiment
- Experiment
The experiment object. See https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment.experiment for more information on experiments.
- cache_datastore_name
- str
the name of datastore to store the profile cache, if None, default datastore will be used
Returns
An object of type DatasetProfileRun class.
Return type
take
Take a sample of records from top of the dataset by the specified count.
take(count)
Parameters
Returns
Returns a new TabularDataset object representing the sampled dataset.
Return type
take_sample
Take a random sample of records in the dataset approximately by the probability specified.
take_sample(probability, seed=None)
Parameters
Returns
Returns a new TabularDataset object representing the sampled dataset.
Return type
time_after
Filter TabularDataset with time stamp columns after a specified start time.
time_after(start_time, include_boundary=True, validate=True)
Parameters
- include_boundary
- bool
Indicate if the row associated with the boundary time (start_time
) should be
included.
- validate
- bool
Indicates whether to validate if specified columns exist in dataset. The default is True. Validation requires that the data source is accessible from the current compute.
Returns
A TabularDataset with the new filtered dataset.
Return type
time_before
Filter TabularDataset with time stamp columns before a specified end time.
time_before(end_time, include_boundary=True, validate=True)
Parameters
- include_boundary
- bool
Indicate if the row associated with the boundary time (end_time
) should be
included.
- validate
- bool
Indicates whether to validate if specified columns exist in dataset. The default is True. Validation requires that the data source is accessible from the current compute.
Returns
A TabularDataset with the new filtered dataset.
Return type
time_between
Filter TabularDataset between a specified start and end time.
time_between(start_time, end_time, include_boundary=True, validate=True)
Parameters
- include_boundary
- bool
Indicate if the row associated with the boundary time (start_end
and
end_time
) should be included.
- validate
- bool
Indicates whether to validate if specified columns exist in dataset. The default is True. Validation requires that the data source is accessible from the current compute.
Returns
A TabularDataset with the new filtered dataset.
Return type
time_recent
Filter TabularDataset to contain only the specified duration (amount) of recent data.
time_recent(time_delta, include_boundary=True, validate=True)
Parameters
- include_boundary
- bool
Indicate if the row associated with the boundary time (time_delta
)
should be included.
- validate
- bool
Indicates whether to validate if specified columns exist in dataset. The default is True. Validation requires that the data source is accessible from the current compute.
Returns
A TabularDataset with the new filtered dataset.
Return type
to_csv_files
Convert the current dataset into a FileDataset containing CSV files.
The resulting dataset will contain one or more CSV files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.
to_csv_files(separator=',')
Parameters
Returns
Returns a new FileDataset object with a set of CSV files containing the data in this dataset.
Return type
to_dask_dataframe
Note
This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Return a Dask DataFrame that can lazily read the data in the dataset.
to_dask_dataframe(sample_size=10000, dtypes=None, on_error='null', out_of_range_datetime='null')
Parameters
- sample_size
The number of records to read to determine schema and types.
- dtypes
An optional dict specifying the expected columns and their dtypes. sample_size is ignored if this is provided.
- on_error
How to handle any error values in the dataset, such as those produced by an error while parsing values. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.
- out_of_range_datetime
How to handle date-time values that are outside the range supported by Pandas. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.
Returns
dask.dataframe.core.DataFrame
to_pandas_dataframe
Load all records from the dataset into a pandas DataFrame.
to_pandas_dataframe(on_error='null', out_of_range_datetime='null')
Parameters
- on_error
How to handle any error values in the dataset, such as those produced by an error while parsing values. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.
- out_of_range_datetime
How to handle date-time values that are outside the range supported by Pandas. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.
Returns
Returns a pandas DataFrame.
Return type
to_parquet_files
Convert the current dataset into a FileDataset containing Parquet files.
The resulting dataset will contain one or more Parquet files, each corresponding to a partition of data from the current dataset. These files are not materialized until they are downloaded or read from.
to_parquet_files()
Returns
Returns a new FileDataset object with a set of Parquet files containing the data in this dataset.
Return type
to_spark_dataframe
Load all records from the dataset into a Spark DataFrame.
to_spark_dataframe()
Returns
Returns a Spark DataFrame.
Return type
with_timestamp_columns
Define timestamp columns for the dataset.
with_timestamp_columns(timestamp=None, partition_timestamp=None, validate=False, **kwargs)
Parameters
- timestamp
- str
The name of column as timestamp (used to be referred as fine_grain_timestamp) (optional). The default is None(clear).
- partition_timestamp
- str
The name of column partition_timestamp (used to be referred as coarse grain timestamp) (optional). The default is None(clear).
- validate
- bool
Indicates whether to validate if specified columns exist in dataset. The default is False. Validation requires that the data source is accessible from the current compute.
Returns
Returns a new TabularDataset with timestamp columns defined.
Return type
Remarks
The method defines columns to be used as timestamps. Timestamp columns on a dataset make it possible
to treat the data as time-series data and enable additional capabilities. When a dataset has
both timestamp (used to be referred as fine_grain_timestamp)
and partition_timestamp (used to be referred as coarse grain timestamp)
specified, the two columns should represent the same timeline.
Attributes
timestamp_columns
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for