Convert to Dataset
Converts data input to the internal Dataset format used by Microsoft Azure Machine Learning
Category: Data Format Conversions
This article describes how to use the Convert to Dataset module in Azure Machine Learning Studio, to convert any data that you might need for an experiment to the internal format used by Studio.
Conversion is not required in most cases, because Azure Machine Learning implicitly converts data to its native dataset format when any operation is performed on the data.
However, saving data to the dataset format is recommended if you have performed some kind of normalization or cleaning on a set of data, and you want to ensure that the changes are used in further experiments.
Convert to Dataset changes only the format of the data, and it does not save a new copy of the data in the workspace. To save the dataset, double-click the output port, select Save as dataset, and type a new name.
How to use Convert to Dataset
Connect it to any module that outputs a dataset.
As long as the data is tabular, you can convert it to a dataset. This includes data loaded using Import Data, data created by using Enter Data Manually, data generated by code in custom modules, datasets transformed by using Apply Transformation, or datasets that were generated or modified by using Apply SQL Transformation.
In the Action dropdown list, indicate if you want to do any cleanup on the data before saving the dataset:
None: Use the data as is.
SetMissingValue: Specify a placeholder that is inserted in the dataset wherever there is a missing value. The default placeholder is the question mark character (?), but you can use the Custom missing value option to type a different value.
ReplaceValues: Use this option to specify a single exact value to be replaced with any other exact value. For example, assuming your data contains the string
obsused as a placeholder for missing values, you could specify a custom replacement operation using these options:
Set Replace to Custom
For Custom value, type the value you want to find. In this case, you would type
- For New value, type the new value to replace the original string with. In this case, you might type
Note that the ReplaceValues operation applies only to exact matches. For example, these strings would not be affected:
- SparseOutput: Indicates that the dataset is sparse. By creating a sparse data vector, you can ensure that missing values do not affect a sparse data distribution. After choosing this option, you must indicate how missing values and zero values should be handled.
To remove any value other than zero, click the Remove option and type a single value to remove. You can remove missing values, or set a custom value to delete from the vector. Only exact matches will be removed. For example, if you type
xin the Remove value text box, the row
xxwould not be affected.
By default, the option Remove zeroes is set to
True, meaning that all zero values are removed when the sparse column is created.
Run the experiment, or right-click the Convert to Dataset module and select Run selected.
- To save the resulting dataset with a new name, right-click the output of Convert to Dataset and select Save as Dataset.
CRM sample: Reads from a shared dataset and saves a copy of the dataset in the local workspace.
Flight Delay example: Saves a dataset that has been cleaned by replacing missing values so that you can use it for future experiments.
This section contains implementation details, tips, and answers to frequently asked questions.
Any module that takes a dataset as input can also take data in the CSV, TSV, or ARFF formats. Before any module code is executed, preprocessing of the inputs is performed, which is equivalent to running the Convert to Dataset module on the input.
You cannot convert from the SVMLight format to dataset.
When specifying a custom replace operation, the search and replace operation applies to complete values; partial matches are not allowed. For example, you can replace a 3 with a -1 or with 33, but you cannot replace a 3 in a two-digit number such as 35.
For custom replace operations, the replacement will silently fail if you use as a replacement any character that does not conform to the current data type of the column.
If you need to save data that uses numerical data that is sparse and has missing values, internally, Studio supports sparse arrays by using a SparseVector, which is a class in the Math.NET numeric library. Prepare your data that uses zeros and has missing values, and then use Convert to Dataset with the arguments SparseOutput and Remove Zeros = TRUE.
|Dataset||Data Table||Input dataset|
|Action||List||Action Method||None||Action to apply to input dataset|
|Results dataset||Data Table||Output dataset|