Data Format Conversions

Note

Applies to: Machine Learning Studio

This content pertains only to Studio. Similar drag and drop modules have been added to the visual interface in Machine Learning service. Learn more in this article comparing the two versions.

This article lists the modules provided in Azure Machine Learning Studio for converting data among various file formats used in machine learning.

The supported formats include:

  • The dataset format that's used throughout Azure Machine Learning.
  • The ARFF format that's used by Weka. Weka is an open-source Java-based set of machine learning algorithms.
  • The SVMLight format. The SVMLight format was developed for the SVMlight framework for machine learning. It can also be used by Vowpal Wabbit.
  • The tab-separated (TSV) and comma-separated (CSV) flat file formats that are supported by most relational databases. These formats are also widely supported by R and Python.

When you convert data to these formats, you can more easily move results and data between different machine learning frameworks or storage mechanisms.

Note

These data conversion modules only convert the complete dataset to a specified format. If you need to do any casting, truncation, conversion of date-time formats, or other manipulation of the values, use the modules in Data Transformation, or see the list of related tasks.

Common data conversion scenarios

You typically use the data conversion modules if you need to move data from an Azure Machine Learning experiment to another machine learning tool or platform. You also can use the modules to export data from Machine Learning in a format that can be used by a database or other tools. For example:

Task Use this
You need to save an intermediate dataset to use in Excel, or to import to a database. Use the CSV module or the TSV module to prepare the data in the correct format. Then, either download the data or save it to Azure Storage.
You want to reuse data from your experiment in R or Python code. Use the CSV module or the TSV module to prepare the data. Then, right-click the converted dataset to get the Python code that you need to access the dataset.
You are porting your experiment and data between Weka and Azure Machine Learning. Use the ARFF module to prepare the data. Then, download the results.
You need to prepare data in the SVMlight framework. Use the Convert to SVMLight module to prepare the data. Then, download the resulting data.
Create data to use with Vowpal Wabbit. Use the SVMLight format. Then, modify the files as described in the article. Save the file in Azure Blob storage to use with a Vowpal Wabbit module in Azure Machine Learning.
Data is not in a tabular format. Coerce it to a dataset format by using the Convert to Dataset module.

If you need to import data into Azure Machine Learning or transform data in individual columns, use these modules before you perform data conversion:

Task Use this
Import data from my computer into Azure Machine Learning. Upload datasets in CSV format as described in Import your training data into Azure Machine Learning Studio.
Import data from a cloud data source, including Hadoop or Azure. Use the Import Data module.
Save machine learning datasets to Azure Blob storage, a Hadoop cluster, or other cloud-based storage. Use the Export Data module.
Change the data type of columns or cast columns to a different format or type. In Azure Machine Learning, use the Edit Metadata or Apply SQL Transformation modules. If you are proficient with R or Python, try the Execute Python Script or Execute R Script modules.
Round, group, or normalize numerical data. Use the Apply Math Operation, Group Data into Bins, or Normalize Data modules.

List of modules

The Data Format Conversions category includes these modules:

  • Convert to ARFF: Converts data input to the attribute relation file format that's used by the Weka toolset.
  • Convert to CSV: Converts a dataset to a comma-separated values format.
  • Convert to Dataset: Converts data input to the internal dataset format that's used by Azure Machine Learning.
  • Convert to SVMLight: Converts data input to the format that's used by the SVMlight framework.
  • Convert to TSV: Converts data input to the tab-delimited format.

See also