- An Azure subscription. Before you begin this tutorial, you must have an Azure subscription. See Create your free Azure account today.
Create HDInsight Spark cluster
Click the following image to open the template in the Azure portal.
Enter the following values:
- Subscription: Select your Azure subscription for this cluster.
- Resource group: Create a resource group or select an existing one. Resource group is used to manage Azure resources for your projects.
- Location: Select a location for the resource group. This location is also used for the default cluster storage and the HDInsight cluster.
- ClusterName: Enter a name for the Hadoop cluster that you create.
- Spark version: Select the Spark version that you want to install on the cluster.
- Cluster login name and password: The default login name is admin.
- SSH user name and password.
Write down these values. You need them later in the tutorial.
Select I agree to the terms and conditions stated above, select Pin to dashboard, and then click Purchase. You can see a new tile titled Submitting deployment for Template deployment. It takes about 20 minutes to create the cluster.
This article creates a Spark cluster that uses Azure Storage Blobs as the cluster storage. You can also create a Spark cluster that uses Azure Data Lake Store as additional storage, in addition to Azure Storage Blobs as the default storage. For instructions, see Create an HDInsight cluster with Data Lake Store.
Run an interactive Spark SQL query
In this section, you use Jupyter notebook to run interactive Spark SQL queries against the Spark cluster you created earlier. HDInsight Spark clusters provide three kernels that you can use with the Jupyter notebook. These are:
- PySpark (for applications written in Python)
- PySpark3 (for applications written in Python3)
- Spark (for applications written in Scala)
In this article, you use the PySpark kernel in the notebook from where you run the interactive Spark SQL query. For more information about the kernels, see Use Jupyter notebook kernels with Apache Spark clusters in HDInsight. Some of the key benefits of using the PySpark kernel are:
- The contexts for Spark and Hive are set automatically.
- Use cell magics, such as
%%sql, to directly run interactive SQL or Hive queries, without any preceding code snippets.
- The output from the interactive queries is automatically visualized.
Create Jupyter notebook with PySpark kernel
Open the Azure portal.
If you opted to pin the cluster to the dashboard, click the cluster tile from the dashboard to launch the cluster blade.
If you did not pin the cluster to the dashboard, from the left pane, click HDInsight clusters, and then click the cluster you created.
From Quick links, click Cluster dashboards, and then click Jupyter Notebook. If prompted, enter the admin credentials for the cluster.
You may also access the Jupyter notebook for your cluster by opening the following URL in your browser. Replace CLUSTERNAME with the name of your cluster:
Create a notebook. Click New, and then click PySpark.
A new notebook is created and opened with the name Untitled(Untitled.pynb).
Click the notebook name at the top, and enter a friendly name if you want.
Paste the following code in an empty cell, and then press SHIFT + ENTER to run the code. The code imports the types required for this scenario:
from pyspark.sql.types import *
Because you created a notebook using the PySpark kernel, you do not need to create any contexts explicitly. The Spark and Hive contexts are automatically created for you when you run the first code cell.
Every time you run an interactive query in Jupyter, your web browser window title shows a (Busy) status along with the notebook title. You also see a solid circle next to the PySpark text in the top-right corner. After the job is completed, it changes to a hollow circle.
Register a sample data set as a temporary table (hvac) by running the following code.
# Load the data hvacText = sc.textFile("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv") # Create the schema hvacSchema = StructType([StructField("date", StringType(), False),StructField("time", StringType(), False),StructField("targettemp", IntegerType(), False),StructField("actualtemp", IntegerType(), False),StructField("buildingID", StringType(), False)]) # Parse the data in hvacText hvac = hvacText.map(lambda s: s.split(",")).filter(lambda s: s != "Date").map(lambda s:(str(s), str(s), int(s), int(s), str(s) )) # Create a data frame hvacdf = sqlContext.createDataFrame(hvac,hvacSchema) # Register the data frame as a table to run queries against hvacdf.registerTempTable("hvac")
Spark clusters in HDInsight come with a sample data file, hvac.csv, under \HdiSamples\HdiSamples\SensorSampleData\hvac.
To run interactive query on the data, use the following code.
%%sql SELECT buildingID, (targettemp - actualtemp) AS temp_diff, date FROM hvac WHERE date = \"6/1/13\"
Because you are using a PySpark kernel, you can now directly run an interactive SQL query on the temporary table hvac that you created by using the
%%sqlmagic. For more information about the
%%sqlmagic, and other magics available with the PySpark kernel, see Kernels available on Jupyter notebooks with Spark HDInsight clusters.
The following tabular output is displayed by default.
You can also see the results in other visualizations as well. For example, an area graph for the same output would look like the following.
Shut down the notebook to release the cluster resources after you have finished running the application. To do so, from the File menu on the notebook, click Close and Halt.
Delete the cluster
Billing for HDInsight clusters is prorated per minute, whether you are using them or not. Be sure to delete your cluster after you have finished using it. For more information, see How to delete an HDInsight cluster.
Troubleshoot access control
If you run into an issue with creating HDInsight clusters, see access control requirements.
- Spark with BI: Perform interactive data analysis using Spark in HDInsight with BI tools
- Spark with Machine Learning: Use Spark in HDInsight for analyzing building temperature using HVAC data
- Spark with Machine Learning: Use Spark in HDInsight to predict food inspection results
- Spark Streaming: Use Spark in HDInsight for building real-time streaming applications
- Website log analysis using Spark in HDInsight
- Application Insight telemetry data analysis using Spark in HDInsight
Create and run applications
Tools and extensions
- Use HDInsight Tools Plugin for IntelliJ IDEA to create and submit Spark Scala applications
- Use HDInsight Tools Plugin for IntelliJ IDEA to debug Spark applications remotely
- Use Zeppelin notebooks with a Spark cluster on HDInsight
- Kernels available for Jupyter notebook in Spark cluster for HDInsight
- Use external packages with Jupyter notebooks
- Install Jupyter on your computer and connect to an HDInsight Spark cluster