Convert to Dataset

Converts data input to the internal Dataset format used by Microsoft Azure Machine Learning

Category: Data Format Conversions

Module overview

This article describes how to use the Convert to Dataset module in Azure Machine Learning Studio, to convert any data that you might need for an experiment to the internal format used by Studio.

Conversion is not required in most cases, because Azure Machine Learning implicitly converts data to its native dataset format when any operation is performed on the data.

However, saving data to the dataset format is recommended if you have performed some kind of normalization or cleaning on a set of data, and you want to ensure that the changes are used in further experiments.

Note

Convert to Dataset changes only the format of the data, and it does not save a new copy of the data in the workspace. To save the dataset, double-click the output port, select Save as dataset, and type a new name.

How to use Convert to Dataset

We recommend that you use the Edit Metadata module to prepare the dataset before using Convert to Dataset. You can add or change column names, adjust data types, and so forth.

  1. Add the Convert to Dataset module to your experiment. You can find this module in the Data Format Conversions category in Azure Machine Learning Studio.

  2. Connect it to any module that outputs a dataset.

    As long as the data is tabular, you can convert it to a dataset. This includes data loaded using Import Data, data created by using Enter Data Manually, data generated by code in custom modules, datasets transformed by using Apply Transformation, or datasets that were generated or modified by using Apply SQL Transformation.

  3. In the Action dropdown list, indicate if you want to do any cleanup on the data before saving the dataset:

    • None: Use the data as is.

    • SetMissingValue: Specify a placeholder that is inserted in the dataset wherever there is a missing value. The default placeholder is the question mark character (?), but you can use the Custom missing value option to type a different value.

    • ReplaceValues: Use this option to specify a single exact value to be replaced with any other exact value. For example, assuming your data contains the string obs used as a placeholder for missing values, you could specify a custom replacement operation using these options:

      1. Set Replace to Custom

      2. For Custom value, type the value you want to find. In this case, you would type obs.

      3. For New value, type the new value to replace the original string with. In this case, you might type ?

    Note that the ReplaceValues operation applies only to exact matches. For example, these strings would not be affected: obs., obsolete.

    • SparseOutput: Indicates that the dataset is sparse. By creating a sparse data vector, you can ensure that missing values do not affect a sparse data distribution. After choosing this option, you must indicate how missing values and zero values should be handled.

    To remove any value other than zero, click the Remove option and type a single value to remove. You can remove missing values, or set a custom value to delete from the vector. Only exact matches will be removed. For example, if you type x in the Remove value text box, the row xx would not be affected.

    By default, the option Remove zeroes is set to True, meaning that all zero values are removed when the sparse column is created.

  4. Run the experiment, or right-click the Convert to Dataset module and select Run selected.

Results

  • To save the resulting dataset with a new name, right-click the output of Convert to Dataset and select Save as Dataset.

Examples

You can see examples of how the Convert to Dataset module is used in the Azure AI Gallery:

  • CRM sample: Reads from a shared dataset and saves a copy of the dataset in the local workspace.

  • Flight Delay example: Saves a dataset that has been cleaned by replacing missing values so that you can use it for future experiments.

Technical notes

This section contains implementation details, tips, and answers to frequently asked questions.

  • Any module that takes a dataset as input can also take data in the CSV, TSV, or ARFF formats. Before any module code is executed, preprocessing of the inputs is performed, which is equivalent to running the Convert to Dataset module on the input.

  • You cannot convert from the SVMLight format to dataset.

  • When specifying a custom replace operation, the search and replace operation applies to complete values; partial matches are not allowed. For example, you can replace a 3 with a -1 or with 33, but you cannot replace a 3 in a two-digit number such as 35.

  • For custom replace operations, the replacement will silently fail if you use as a replacement any character that does not conform to the current data type of the column.

  • If you need to save data that uses numerical data that is sparse and has missing values, internally, Studio supports sparse arrays by using a SparseVector, which is a class in the Math.NET numeric library. Prepare your data that uses zeros and has missing values, and then use Convert to Dataset with the arguments SparseOutput and Remove Zeros = TRUE.

Expected inputs

Name Type Description
Dataset Data Table Input dataset

Module parameters

Name Range Type Default Description
Action List Action Method None Action to apply to input dataset

Output

Name Type Description
Results dataset Data Table Output dataset

See also

Data Format Conversions
A-Z Module List