Use Apache Sqoop to import and export data between Hadoop on HDInsight and SQL Database

Learn how to use Apache Sqoop to import and export between a Hadoop cluster in Azure HDInsight and Azure SQL Database or Microsoft SQL Server database. The steps in this document use the sqoop command directly from the headnode of the Hadoop cluster. You use SSH to connect to the head node and run the commands in this document.

Important

The steps in this document only work with HDInsight clusters that use Linux. Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight retirement on Windows.

Warning

The steps in this document assume that you have already created an Azure SQL Database named sqooptest.

This document provides T-SQL statements that are used to create and query a table in SQL Database. There are many clients that you can use these statements with SQL Database. We recommend the following clients:

Create the table in SQL Database

Important

If you are using the HDInsight cluster and SQL Database created in Create cluster and SQL database, skip the steps in this section. The database and table were created as part of the steps in the Create cluster and SQL database document.

Use a SQL client to connect to the sqooptest database in your SQL Database. Then use the following T-SQL to create a table named mobiledata:

CREATE TABLE [dbo].[mobiledata](
[clientid] [nvarchar](50),
[querytime] [nvarchar](50),
[market] [nvarchar](50),
[deviceplatform] [nvarchar](50),
[devicemake] [nvarchar](50),
[devicemodel] [nvarchar](50),
[state] [nvarchar](50),
[country] [nvarchar](50),
[querydwelltime] [float],
[sessionid] [bigint],
[sessionpagevieworder] [bigint])
GO
CREATE CLUSTERED INDEX mobiledata_clustered_index on mobiledata(clientid)
GO

Sqoop export

  1. Use SSH to connect to the HDInsight cluster. For example, the following command connects to the primary headnode of a cluster named mycluster:

    ssh CLUSTERNAME-ssh.azurehdinsight.net
    

    For more information, see Use SSH with HDInsight.

  2. To verify that Sqoop can see your SQL Database, use the following command:

    sqoop list-databases --connect jdbc:sqlserver://<serverName>.database.windows.net:1433 --username <adminLogin> -P
    

    When prompted, enter the password for the SQL Database login.

    This command returns a list of databases, including the sqooptest database used earlier.

  3. To export data from the Hive hivesampletable table to the mobiledata table in SQL Database, use the following command:

    sqoop export --connect 'jdbc:sqlserver://<serverName>.database.windows.net:1433;database=sqooptest' --username <adminLogin> -P -table 'mobiledata' --hcatalog-table hivesampletable
    
  4. To verify that data was exported, use the following query from your SQL client to view the exported data:

    SET ROWCOUNT 50;
    SELECT * FROM mobiledata;"
    

    This command lists 50 rows that were imported into the table.

Sqoop import

  1. Use the following command to import data from the mobiledata table in SQL Database, to the wasb:///tutorials/usesqoop/importeddata directory on HDInsight:

    sqoop import --connect 'jdbc:sqlserver://<serverName>.database.windows.net:1433;database=sqooptest' --username <adminLogin> --password <adminPassword> --table 'mobiledata' --target-dir 'wasb:///tutorials/usesqoop/importeddata' --fields-terminated-by '\t' --lines-terminated-by '\n' -m 1
    

    The fields in the data are separated by a tab character, and the lines are terminated by a new-line character.

    Important

    The wasb:/// path works with clusters that use Azure Storage as the default cluster storage. For clusters that use Azure Data Lake Store, use adl:/// instead.

  2. Once the import has completed, use the following command to list out the data in the new directory:

    hdfs dfs -text /tutorials/usesqoop/importeddata/part-m-00000
    

Using SQL Server

You can also use Sqoop to import and export data from SQL Server. The differences between using SQL Database and SQL Server are:

  • Both HDInsight and SQL Server must be on the same Azure Virtual Network.

    For an example, see the Connect HDInsight to your on-premises network document.

    For more information on using HDInsight with an Azure Virtual Network, see the Extend HDInsight with Azure Virtual Network document. For more information on Azure Virtual Network, see the Virtual Network Overview document.

  • SQL Server must be configured to allow SQL authentication. For more information, see the Choose an Authentication Mode document.

  • You may have to configure SQL Server to accept remote connections. For more information, see the How to troubleshoot connecting to the SQL Server database engine document.

  • Use the following Transact-SQL statements to create the mobiledata table:

    CREATE TABLE [dbo].[mobiledata](
    [clientid] [nvarchar](50),
    [querytime] [nvarchar](50),
    [market] [nvarchar](50),
    [deviceplatform] [nvarchar](50),
    [devicemake] [nvarchar](50),
    [devicemodel] [nvarchar](50),
    [state] [nvarchar](50),
    [country] [nvarchar](50),
    [querydwelltime] [float],
    [sessionid] [bigint],
    [sessionpagevieworder] [bigint])
    
  • When connecting to the SQL Server from HDInsight, you may have to use the IP address of the SQL Server. For example:

    sqoop import --connect 'jdbc:sqlserver://10.0.1.1:1433;database=sqooptest' --username <adminLogin> -P <adminPassword> -table 'mobiledata' --target-dir 'wasb:///tutorials/usesqoop/importeddata' --fields-terminated-by '\t' --lines-terminated-by '\n' -m 1
    

Limitations

  • Bulk export - With Linux-based HDInsight, the Sqoop connector used to export data to Microsoft SQL Server or Azure SQL Database does not support bulk inserts.

  • Batching - With Linux-based HDInsight, When using the -batch switch when performing inserts, Sqoop makes multiple inserts instead of batching the insert operations.

Next steps

Now you have learned how to use Sqoop. To learn more, see: