Run interactive queries on Spark clusters in HDInsight
Learn how to use Jupyter notebook to run interactive Spark SQL queries against a Spark cluster.
Jupyter notebook is a browser-based application that extends the console-based interactive experience to the Web. Spark on HDInsight also includes Zeppelin Notebook. Jupyter Notebook is used in this tutorial.
Jupyter notebooks on HDInsight clusters support three kernels - PySpark, PySpark3, and Spark. The PySpark kernel is used in this tutorial. For more information about the kernels, and the benefits of using PySpark, see Use Jupyter notebook kernels with Apache Spark clusters in HDInsight. To use Zeppelin Notebook, see Use Zeppelin notebooks with Apache Spark cluster on Azure HDInsight.
In this tutorial, you query data in a csv file. You must first load that data into Spark as a dataframe. Then you can run queries on the dataframe using Jupyter Notebook.
- An Azure HDInsight Spark cluster. For the instructions, see Create an Apache Spark cluster in Azure HDInsight.
- A Jupyter notebook using PySpark. For the instructions, see Create a Jupyter Notebook.
Create a dataframe from a csv file
With a SQLContext, applications can create dataframes from an existing RDD, from a Hive table, or from data sources.
To create a dataframe from a csv file
Create a Jupyter notebook using PySpark if you don't have one. For the instructions, see Create a Jupyter Notebook.
Paste the following code in an empty cell of the notebook, and then press SHIFT + ENTER to run the code. The code imports the types required for this scenario:
from pyspark.sql import * from pyspark.sql.types import *
By using the PySpark kernel to create a notebook, the Spark, and Hive contexts are automatically created for you when you run the first code cell. You do not need to explicitly create any contexts.
When running an interactive query in Jupyter, the web browser window or tab caption shows a (Busy) status along with the notebook title. You also see a solid circle next to the PySpark text in the top-right corner. After the job is completed, it changes to a hollow circle.
Run the following code to create a dataframe and a temporary table (hvac) by running the following code: The code doesn't extract all the columns available in the CSV file.
# Create an RDD from sample data hvacText = sc.textFile("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv") # Create a schema for our data Entry = Row('Date', 'Time', 'TargetTemp', 'ActualTemp', 'BuildingID') # Parse the data and create a schema hvacParts = hvacText.map(lambda s: s.split(',')).filter(lambda s: s != 'Date') hvac = hvacParts.map(lambda p: Entry(str(p), str(p), int(p), int(p), int(p))) # Infer the schema and create a table hvacTable = sqlContext.createDataFrame(hvac) hvacTable.registerTempTable('hvactemptable') dfw = DataFrameWriter(hvacTable) dfw.saveAsTable('hvac')
The following screenshot shows a snapshot of the HVAC.csv file. The csv file comes with all HDInsigt Spark clusters. The data captures the temperature variations of a building.
Run queries on the dataframe
Once the table is created, you can run an interactive query on the data.
To run a query
Run the following code in an empty cell of the notebook:
%%sql SELECT buildingID, (targettemp - actualtemp) AS temp_diff, date FROM hvac WHERE date = \"6/1/13\"
Because the PySpark kernel is used in the notebook, you can now directly run an interactive SQL query on the temporary table hvac that you created by using the
%%sqlmagic. For more information about the
%%sqlmagic, and other magics available with the PySpark kernel, see Kernels available on Jupyter notebooks with Spark HDInsight clusters.
The following tabular output is displayed by default.
You can also see the results in other visualizations as well. To see an area graph for the same output, select Area then set other values as shown.
From the File menu on the notebook, click Save and Checkpoint.
If you're starting the next tutorial now, leave the notebook open. If not, shut down the notebook to release the cluster resources: from the File menu on the notebook, click Close and Halt.
In this article, you learned how to run interactive queries in Spark using Jupyter notebook. Advance to the next article to see how the data you registered in Spark can be pulled into a BI analytics tool such as Power BI and Tableau.