Connect Excel to Apache Hadoop by using Power Query
One key feature of the Microsoft big-data solution is the integration of Microsoft business intelligence (BI) components with Apache Hadoop clusters in Azure HDInsight. A primary example is the ability to connect Excel to the Azure Storage account that contains the data associated with your Hadoop cluster by using the Microsoft Power Query for Excel add-in. This article walks you through how to set up and use Power Query to query data associated with a Hadoop cluster managed with HDInsight.
Before you begin this article, you must have the following items:
- An HDInsight cluster. To configure one, see [Get started with Azure HDInsight][hdinsight-get-started].
- A workstation that is running Windows 7, Windows Server 2008 R2, or a later operating system.
- Office 2016, Office 2013 Professional Plus, Office 365 ProPlus, Excel 2013 Standalone, or Office 2010 Professional Plus.
Install Power Query
Power Query can import data that has been output or that has been generated by a Hadoop job running on an HDInsight cluster.
In Excel 2016, Power Query has been integrated into the Data ribbon under the Get & Transform section. For older Excel versions, download Microsoft Power Query for Excel from the Microsoft Download Center and install it.
Import HDInsight data into Excel
The Power Query add-in for Excel makes it easy to import data from your HDInsight cluster into Excel, where BI tools such as PowerPivot and Power Map can be used to inspect, analyze, and present the data.
To import data from an HDInsight cluster
Create a new blank workbook.
Perform the following steps based on the Excel version:
Click the Data menu, click Get Data from the Get & Transform Data ribbon, click From Azure, and then click From Azure HDInsight(HDFS).
Click the Power Query menu, click From Azure, and then click From Microsoft Azure HDInsight.
Note: If you don't see the Power Query menu, go to File > Options > Add-ins, and select COM Add-ins from the drop-down Manage box at the bottom of the page. Select the Go... button and verify that the box for the Power Query for Excel add-in has been checked.
Note: Power Query also allows you to import data from HDFS by clicking From Other Sources.
For Account Name, enter the name of the Azure Blob storage account associated with your cluster, and then click OK. This account can be the default storage account or a linked storage account. The format is https://<StorageAccountName>.blob.core.windows.net/.
For Account Key, enter the key for the Blob storage account, and then click Save. (You need to enter the account information only the first time you access this store.)
In the Navigator pane on the left of the Query Editor, double-click the Blob storage container name. By default, the container name is the same name as the cluster name.
Locate HiveSampleData.txt in the Name column (the folder path is ../hive/warehouse/hivesampletable/), and then click Binary on the left of HiveSampleData.txt. HiveSampleData.txt comes with all the cluster. Optionally, you can use your own file.
If you want, you can rename the column names. When you are ready, click Close & Load. The data has been loaded to your workbook:
In this article, you learned how to use Power Query to retrieve data from HDInsight into Excel. Similarly, you can retrieve data from HDInsight into Azure SQL Database. It is also possible to upload data into HDInsight. To learn more, see the following articles:
- Visualize Apache Hive data with Microsoft Power BI in Azure HDInsight.
- Visualize Interactive Query Hive data with Power BI in Azure HDInsight.
- Use Apache Zeppelin to run Apache Hive queries in Azure HDInsight.
- Connect Excel to HDInsight with the Microsoft Hive ODBC Driver.
- Connect to Azure HDInsight and run Apache Hive queries using Data Lake Tools for Visual Studio.
- Use Azure HDInsight Tool for Visual Studio Code.
- Upload data to HDInsight.