Import data into Azure Machine Learning Studio from various online data sources with the Import Data module

This article describes the support for importing online data from various sources and the information needed to move data from these sources into an Azure Machine Learning experiment.

Note

This article provides general information about the Import Data module. For more detailed information about the types of data you can access, formats, parameters, and answers to common questions, see the module reference topic for the Import Data module.

Introduction

By using the Import Data module, you can access data from one of several online data sources while your experiment is running in Azure Machine Learning Studio:

  • A Web URL using HTTP
  • Hadoop using HiveQL
  • Azure blob storage
  • Azure table
  • Azure SQL database or SQL Server on Azure VM
  • On-premises SQL Server database
  • A data feed provider, OData currently
  • Azure Cosmos DB

To access online data sources in your Studio experiment, add the Import Data module to your, select the Data source, and then provide the parameters needed to access the data. The online data sources that are supported are itemized in the table below. This table also summarizes the file formats that are supported and parameters that are used to access the data.

Note that because this training data is accessed while your experiment is running, it's only available in that experiment. By comparison, data that has been stored in a dataset module are available to any experiment in your workspace.

Important

Currently, the Import Data and Export Data modules can read and write data only from Azure storage created using the Classic deployment model. In other words, the new Azure Blob Storage account type that offers a hot storage access tier or cool storage access tier is not yet supported.

Generally, any Azure storage accounts that you might have created before this service option became available should not be affected. If you need to create a new account, select Classic for the Deployment model, or use Resource manager and select General purpose rather than Blob storage for Account kind.

For more information, see Azure Blob Storage: Hot and Cool Storage Tiers.

Supported online data sources

Azure Machine Learning Import Data module supports the following data sources:

Data Source Description Parameters
Web URL via HTTP Reads data in comma-separated values (CSV), tab-separated values (TSV), attribute-relation file format (ARFF), and Support Vector Machines (SVM-light) formats, from any web URL that uses HTTP URL: Specifies the full name of the file, including the site URL and the file name, with any extension.

Data format: Specifies one of the supported data formats: CSV, TSV, ARFF, or SVM-light. If the data has a header row, it is used to assign column names.
Hadoop/HDFS Reads data from distributed storage in Hadoop. You specify the data you want by using HiveQL, a SQL-like query language. HiveQL can also be used to aggregate data and perform data filtering before you add the data to Machine Learning Studio. Hive database query: Specifies the Hive query used to generate the data.

HCatalog server URI : Specified the name of your cluster using the format <your cluster name>.azurehdinsight.net.

Hadoop user account name: Specifies the Hadoop user account name used to provision the cluster.

Hadoop user account password : Specifies the credentials used when provisioning the cluster. For more information, see Create Hadoop clusters in HDInsight.

Location of output data: Specifies whether the data is stored in a Hadoop distributed file system (HDFS) or in Azure.
    If you store output data in HDFS, specify the HDFS server URI. (Be sure to use the HDInsight cluster name without the HTTPS:// prefix).

    If you store your output data in Azure, you must specify the Azure storage account name, Storage access key and Storage container name.
SQL database Reads data that is stored in an Azure SQL database or in a SQL Server database running on an Azure virtual machine. Database server name: Specifies the name of the server on which the database is running.
    In case of Azure SQL Database enter the server name that is generated. Typically it has the form <generated_identifier>.database.windows.net.

    In case of a SQL server hosted on a Azure Virtual machine enter tcp:<Virtual Machine DNS Name>, 1433

Database name : Specifies the name of the database on the server.

Server user account name: Specifies a user name for an account that has access permissions for the database.

Server user account password: Specifies the password for the user account.

Database query:Enter a SQL statement that describes the data you want to read.
On-premises SQL database Reads data that is stored in an on-premises SQL database. Data gateway: Specifies the name of the Data Management Gateway installed on a computer where it can access your SQL Server database. For information about setting up the gateway, see Perform advanced analytics with Azure Machine Learning using data from an on-premises SQL server.

Database server name: Specifies the name of the server on which the database is running.

Database name : Specifies the name of the database on the server.

Server user account name: Specifies a user name for an account that has access permissions for the database.

User name and password: Click Enter values to enter your database credentials. You can use Windows Integrated Authentication or SQL Server Authentication depending upon how your on-premises SQL Server is configured.

Database query:Enter a SQL statement that describes the data you want to read.
Azure Table Reads data from the Table service in Azure Storage.

If you read large amounts of data infrequently, use the Azure Table Service. It provides a flexible, non-relational (NoSQL), massively scalable, inexpensive, and highly available storage solution.
The options in the Import Data change depending on whether you are accessing public information or a private storage account that requires login credentials. This is determined by the Authentication Type which can have value of "PublicOrSAS" or "Account", each of which has its own set of parameters.

Public or Shared Access Signature (SAS) URI: The parameters are:

    Table URI: Specifies the Public or SAS URL for the table.

    Specifies the rows to scan for property names: The values are TopN to scan the specified number of rows, or ScanAll to get all rows in the table.

    If the data is homogeneous and predictable, it is recommended that you select TopN and enter a number for N. For large tables, this can result in quicker reading times.

    If the data is structured with sets of properties that vary based on the depth and position of the table, choose the ScanAll option to scan all rows. This ensures the integrity of your resulting property and metadata conversion.

Private Storage Account: The parameters are:

    Account name: Specifies the name of the account that contains the table to read.

    Account key: Specifies the storage key associated with the account.

    Table name : Specifies the name of the table that contains the data to read.

    Rows to scan for property names: The values are TopN to scan the specified number of rows, or ScanAll to get all rows in the table.

    If the data is homogeneous and predictable, we recommend that you select TopN and enter a number for N. For large tables, this can result in quicker reading times.

    If the data is structured with sets of properties that vary based on the depth and position of the table, choose the ScanAll option to scan all rows. This ensures the integrity of your resulting property and metadata conversion.

Azure Blob Storage Reads data stored in the Blob service in Azure Storage, including images, unstructured text, or binary data.

You can use the Blob service to publicly expose data, or to privately store application data. You can access your data from anywhere by using HTTP or HTTPS connections.
The options in the Import Data module change depending on whether you are accessing public information or a private storage account that requires login credentials. This is determined by the Authentication Type which can have a value either of "PublicOrSAS" or of "Account".

Public or Shared Access Signature (SAS) URI: The parameters are:

    URI: Specifies the Public or SAS URL for the storage blob.

    File Format: Specifies the format of the data in the Blob service. The supported formats are CSV, TSV, and ARFF.

Private Storage Account: The parameters are:

    Account name: Specifies the name of the account that contains the blob you want to read.

    Account key: Specifies the storage key associated with the account.

    Path to container, directory, or blob : Specifies the name of the blob that contains the data to read.

    Blob file format: Specifies the format of the data in the blob service. The supported data formats are CSV, TSV, ARFF, CSV with a specified encoding, and Excel.

      If the format is CSV or TSV, be sure to indicate whether the file contains a header row.

      You can use the Excel option to read data from Excel workbooks. In the Excel data format option, indicate whether the data is in an Excel worksheet range, or in an Excel table. In the Excel sheet or embedded table option, specify the name of the sheet or table that you want to read from.

Data Feed Provider Reads data from a supported feed provider. Currently only the Open Data Protocol (OData) format is supported. Data content type: Specifies the OData format.

Source URL: Specifies the full URL for the data feed.
For example, the following URL reads from the Northwind sample database: http://services.odata.org/northwind/northwind.svc/

Next steps

Deploying Azure ML web services that use Data Import and Data Export modules