Use Apache Sqoop with Hadoop in HDInsight
Learn how to use Apache Sqoop in HDInsight to import and export data between an HDInsight cluster and an Azure SQL database.
Although Apache Hadoop is a natural choice for processing unstructured and semi-structured data, such as logs and files, there may also be a need to process structured data that is stored in relational databases.
Apache Sqoop is a tool designed to transfer data between Hadoop clusters and relational databases. You can use it to import data from a relational database management system (RDBMS) such as SQL Server, MySQL, or Oracle into the Hadoop distributed file system (HDFS), transform the data in Hadoop with MapReduce or Apache Hive, and then export the data back into an RDBMS. In this article, you're using a SQL Server database for your relational database.
This article sets up a test environment to perform the data transfer. You then choose a data transfer method for this environment from one of the methods in section Run Sqoop jobs, further below.
For Sqoop versions that are supported on HDInsight clusters, see What's new in the cluster versions provided by HDInsight?
Understand the scenario
HDInsight cluster comes with some sample data. You use the following two samples:
- An Apache Log4j log file, which is located at
/example/data/sample.log. The following logs are extracted from the file:
2012-02-03 18:35:34 SampleClass6 [INFO] everything normal for id 577725851 2012-02-03 18:35:34 SampleClass4 [FATAL] system problem at id 1991281254 2012-02-03 18:35:34 SampleClass3 [DEBUG] detail for id 1304807656 ...
A Hive table named
hivesampletable, which references the data file located at
/hive/warehouse/hivesampletable. The table contains some mobile device data.
Field Data type clientid string querytime string market string deviceplatform string devicemake string devicemodel string state string country string querydwelltime double sessionid bigint sessionpagevieworder bigint
In this article, you use these two datasets to test Sqoop import and export.
Set up test environment
The cluster, SQL database, and other objects are created through the Azure portal using an Azure Resource Manager template. The template can be found in Azure quickstart templates. The Resource Manager template calls a bacpac package to deploy the table schemas to a SQL database. The bacpac package is located in a public blob container, https://hditutorialdata.blob.core.windows.net/usesqoop/SqoopTutorial-2016-2-23-11-2.bacpac. If you want to use a private container for the bacpac files, use the following values in the template:
"storageKeyType": "Primary", "storageKey": "<TheAzureStorageAccountKey>",
Import using a template or the Azure portal only supports importing a BACPAC file from Azure blob storage.
Select the following image to open the Resource Manager template in the Azure portal.
Enter the following properties:
Field Value Subscription Select your Azure subscription from the drop-down list. Resource group Select your resource group from the drop-down list, or create a new one Location Select a region from the drop-down list. Cluster Name Enter a name for the Hadoop cluster. Use lowercase letter only. Cluster Login User Name Keep the pre-populated value
Cluster Login Password Enter a password. Ssh User Name Keep the pre-populated value
Ssh Password Enter a password. Sql Admin Login Keep the pre-populated value
Sql Admin Password Enter a password. _artifacts Location Use the default value unless you want to use your own bacpac file in a different location. _artifacts Location Sas Token Leave blank. Bacpac File Name Use the default value unless you want to use your own bacpac file. Location Use the default value.
The Azure SQL Server name will be
<ClusterName>dbserver. The database name will be
<ClusterName>db. The default storage account name will be
Select I agree to the terms and conditions stated above.
Select Purchase. You see a new tile titled Submitting deployment for Template deployment. It takes about around 20 minutes to create the cluster and SQL database.
Run Sqoop jobs
HDInsight can run Sqoop jobs by using a variety of methods. Use the following table to decide which method is right for you, then follow the link for a walkthrough.
|Use this if you want...||...an interactive shell||...batch processing||...from this client operating system|
|SSH||?||?||Linux, Unix, Mac OS X, or Windows|
|.NET SDK for Hadoop||?||Windows (for now)|
- Bulk export - With Linux-based HDInsight, the Sqoop connector used to export data to Microsoft SQL Server or Azure SQL Database doesn't currently support bulk inserts.
- Batching - With Linux-based HDInsight, When using the
-batchswitch when performing inserts, Sqoop performs multiple inserts instead of batching the insert operations.
Now you've learned how to use Sqoop. To learn more, see:
- Use Apache Hive with HDInsight
- Upload data to HDInsight: Find other methods for uploading data to HDInsight/Azure Blob storage.
- Use Apache Sqoop to import and export data between Apache Hadoop on HDInsight and SQL Database