Apache Spark BI using data visualization tools with Azure HDInsight
Learn how to use data visualization tools such as Power BI and Tableau to analyze a raw sample data set using Apache Spark BI on HDInsight clusters.
Connectivity with BI tools described in this article is not supported on Spark 2.1 on Azure HDInsight 3.6 Preview. Only Spark versions 1.6 and 2.0 (HDInsight 3.4, 3.5 respectively) are supported.
This tutorial is also available as a Jupyter notebook on an HDInsight Spark cluster. The notebook experience lets you run the Python snippets from the notebook itself. To perform the tutorial from within a notebook, create a Spark cluster, launch a Jupyter notebook (
https://CLUSTERNAME.azurehdinsight.net/jupyter), and then run the notebook Use BI tools with Apache Spark on HDInsight.ipynb under the Python folder.
- An Apache Spark cluster on HDInsight. For instructions, see Create Apache Spark clusters in Azure HDInsight.
Prepare data for Spark data visualization
In this section, we use the Jupyter notebook from an HDInsight Spark cluster to run jobs that process your raw sample data and save it as a table. The sample data is a .csv file (hvac.csv) available on all clusters by default. Once your data is saved as a table, in the next section we use BI tools to connect to the table and perform data visualizations.
If you are performing the steps in this article after completing the instructions in Run interactive queries on an HDInsight Spark cluster, you can skip to Step 8 below.
From the Azure portal, from the startboard, click the tile for your Spark cluster (if you pinned it to the startboard). You can also navigate to your cluster under Browse All > HDInsight Clusters.
From the Spark cluster blade, click Cluster Dashboard, and then click Jupyter Notebook. If prompted, enter the admin credentials for the cluster.
You may also reach the Jupyter Notebook for your cluster by opening the following URL in your browser. Replace CLUSTERNAME with the name of your cluster:
Create a notebook. Click New, and then click PySpark.
A new notebook is created and opened with the name Untitled.pynb. Click the notebook name at the top, and enter a friendly name.
Because you created a notebook using the PySpark kernel, you do not need to create any contexts explicitly. The Spark and Hive contexts are automatically created for you when you run the first code cell. You can start by importing the types required for this scenario. To do so, place the cursor in the cell and press SHIFT + ENTER.
from pyspark.sql import *
Load sample data into a temporary table. When you create a Spark cluster in HDInsight, the sample data file, hvac.csv, is copied to the associated storage account under \HdiSamples\HdiSamples\SensorSampleData\hvac.
In an empty cell, paste the following snippet and press SHIFT + ENTER. This snippet registers the data into a table called hvac.
# Create an RDD from sample data hvacText = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv") # Create a schema for our data Entry = Row('Date', 'Time', 'TargetTemp', 'ActualTemp', 'BuildingID') # Parse the data and create a schema hvacParts = hvacText.map(lambda s: s.split(',')).filter(lambda s: s != 'Date') hvac = hvacParts.map(lambda p: Entry(str(p), str(p), int(p), int(p), int(p))) # Infer the schema and create a table hvacTable = sqlContext.createDataFrame(hvac) hvacTable.registerTempTable('hvactemptable') dfw = DataFrameWriter(hvacTable) dfw.saveAsTable('hvac')
Verify that the table was successfully created. You can use the
%%sqlmagic to run Hive queries directly. For more information about the
%%sqlmagic, and other magics available with the PySpark kernel, see Kernels available on Jupyter notebooks with Spark HDInsight clusters.
%%sql SHOW TABLES
You see an output like shown below:
+---------------+-------------+ |tableName |isTemporary | +---------------+-------------+ |hvactemptable |true | |hivesampletable|false | |hvac |false | +---------------+-------------+
Only the tables that have false under the isTemporary column are hive tables that are stored in the metastore and can be accessed from the BI tools. In this tutorial, we connect to the hvac table we created.
Verify that the table contains the intended data. In an empty cell in the notebook, copy the following snippet and press SHIFT + ENTER.
%%sql SELECT * FROM hvac LIMIT 10
Shut down the notebook to release the resources. To do so, from the File menu on the notebook, click Close and Halt.
Use Power BI for Spark data visualization
This section is applicable only for Spark 1.6 on HDInsight 3.4 and Spark 2.0 on HDInsight 3.5.
Once you have saved the data as a table, you can use Power BI to connect to the data and visualize it to create reports, dashboards, etc.
Make sure you have access to Power BI. You can get a free preview subscription of Power BI from http://www.powerbi.com/.
Sign in to Power BI.
From the bottom of the left pane, click Get Data.
On the Get Data page, under Import or Connect to Data, for Databases, click Get.
On the next screen, click Spark on Azure HDInsight and then click Connect. When prompted, enter the cluster URL (
mysparkcluster.azurehdinsight.net) and the credentials to connect to the cluster.
After the connection is established, Power BI starts importing data from the Spark cluster on HDInsight.
Power BI imports the data and adds a Spark dataset under the Datasets heading. Click the data set to open a new worksheet to visualize the data. You can also save the worksheet as a report. To save a worksheet, from the File menu, click Save.
Notice that the Fields list on the right lists the hvac table you created earlier. Expand the table to see the fields in the table, as you defined in notebook earlier.
Build a visualization to show the variance between target temperature and actual temperature for each building. To visualize your data, select Area Chart (shown in red box). To define the axis, drag-and-drop the BuildingID field under Axis, and ActualTemp/TargetTemp fields under Value.
By default the visualization shows the sum for ActualTemp and TargetTemp. For both the fields, from the drop-down, select Average to get an average of actual and target temperatures for both buildings.
Your data visualization should be similar to the one in the screenshot. Move your cursor over the visualization to get tool tips with relevant data.
Click Save from the top menu and provide a report name. You can also pin the visual. When you pin a visualization, it is stored on your dashboard so you can track the latest value at a glance.
You can add as many visualizations as you want for the same dataset and pin them to the dashboard for a snapshot of your data. Also, Spark clusters on HDInsight are connected to Power BI with direct connect. This ensures that Power BI always has the most up-to-date data from your cluster so you do not need to schedule refreshes for the dataset.
Use Tableau Desktop for Spark data visualization
This section is applicable only for Spark 1.5.2 clusters created in Azure HDInsight.
Install Tableau Desktop on the computer where you are running this Apache Spark BI tutorial.
Make sure that computer also has Microsoft Spark ODBC driver installed. You can install the driver from here.
Launch Tableau Desktop. In the left pane, from the list of server to connect to, click Spark SQL. If Spark SQL is not listed by default in the left pane, you can find it by click More Servers.
In the Spark SQL connection dialog box, provide the values as shown in the screenshot, and then click OK.
The authentication drop-down lists Microsoft Azure HDInsight Service as an option, only if you installed the Microsoft Spark ODBC Driver on the computer.
On the next screen, from the Schema drop-down, click the Find icon, and then click default.
For the Table field, click the Find icon again to list all the Hive tables available in the cluster. You should see the hvac table you created earlier using the notebook.
Drag and drop the table to the top box on the right. Tableau imports the data and displays the schema as highlighted by the red box.
Click the Sheet1 tab at the bottom left. Make a visualization that shows the average target and actual temperatures for all buildings for each date. Drag Date and Building ID to Columns and Actual Temp/Target Temp to Rows. Under Marks, select Area to use an area map for Spark data visualization.
By default, the temperature fields are shown as aggregate. If you want to show the average temperatures instead, you can do so from the drop-down, as shown below.
You can also super-impose one temperature map over the other to get a better feel of difference between target and actual temperatures. Move the mouse to the corner of the lower area map till you see the handle shape highlighted in a red circle. Drag the map to the other map on the top and release the mouse when you see the shape highlighted in red rectangle.
Your data visualization should change as shown in the screenshot:
- Click Save to save the worksheet. You can create dashboards and add one or more sheets to it.
So far you learned how to create a cluster, create Spark data frames to query data, and then access that data from BI tools. You can now look at instructions on how to manage the cluster resources and debug jobs that are running in an HDInsight Spark cluster.