Create an Apache Spark cluster in Azure HDInsight
Learn how to create Apache Spark cluster on Azure HDInsight, and how to run Spark SQL queries against Hive tables. For information on Spark on HDInsight, see Overview: Apache Spark on Azure HDInsight.
- An Azure subscription. Before you begin this tutorial, you must have an Azure subscription. See Create your free Azure account.
Create HDInsight Spark cluster
Click the following image to open the template in the Azure portal.
Enter the following values:
- Subscription: Select your Azure subscription used for creating this cluster.
- Resource group: Create a resource group or select an existing one. Resource group is used to manage Azure resources for your projects.
- Location: Select a location for the resource group. The template uses this location for creating the cluster as well as for the default cluster storage.
- ClusterName: Enter a name for the HDInsight cluster that you want to create.
- Cluster login name and password: The default login name is admin.
- SSH user name and password.
Select I agree to the terms and conditions stated above, select Pin to dashboard, and then click Purchase. You can see a new tile titled Deploying Template deployment. It takes about 20 minutes to create the cluster.
If you run into an issue with creating HDInsight clusters, it could be that you do not have the right permissions to do so. For more information, see Access control requirements.
This article creates a Spark cluster that uses Azure Storage Blobs as the cluster storage. You can also create a Spark cluster that uses Azure Data Lake Store as the default storage. For instructions, see Create an HDInsight cluster with Data Lake Store.
Create a Jupyter notebook
Jupyter Notebook is an interactive notebook environment that supports various programming languages that allow you to interact with your data, combine code with markdown text and perform simple visualizations. Spark on HDInsight also includes Zeppelin Notebook. Jupyter Notebook is used in this tutorial.
To create a Jupyter notebook
Open the Azure portal.
Open the Spark cluster you created. For the instructions, see List and show clusters.
From the portal, click Cluster dashboards, and then click Jupyter Notebook. If prompted, enter the admin credentials for the cluster.
You may also access the Jupyter Notebook for your cluster by opening the following URL in your browser. Replace CLUSTERNAME with the name of your cluster:
Click New, and then click PySpark to create a notebook. Jupyter notebooks on HDInsight clusters support three kernels - PySpark, PySpark3, and Spark. The PySpark kernel is used in this tutorial. For more information about the kernels, and the benefits of using PySpark, see Use Jupyter notebook kernels with Apache Spark clusters in HDInsight.
A new notebook is created and opened with the name Untitled(Untitled.pynb).
Click the notebook name at the top, and enter a friendly name if you want.
Run Spark SQL statements on a Hive table
SQL (Structured Query Language) is the most common and widely used language for querying and defining data. Spark SQL functions as an extension to Apache Spark for processing structured data, using the familiar SQL syntax.
Spark SQL supports both SQL and HiveQL as query languages. Its capabilities include binding in Python, Scala, and Java. With it, you can query data stored in many locations, such as external databases, structured data files (example: JSON), and Hive tables.
For an example of reading data from a csv file instead of a Hive table, see Run interactive queries on Spark clusters in HDInsight.
To run Spark SQL
When you start the notebook for the first time, the kernel performs some tasks in the background. Wait for the kernel to be ready. The kernel is ready when you see a hollow circle next to the kernel name in the notebook. Solid circle denotes that the kernel is busy.
When the kernel is ready, paste the following code in an empty cell, and then press SHIFT + ENTER to run the code. The output should list a
hivesampletablethat is available on the cluster, by default.
%%sql SHOW TABLES
When you use a Jupyter Notebook with your HDInsight Spark cluster, you get a preset
sqlContextthat you can use to run Hive queries using Spark SQL.
%%sqltells Jupyter Notebook to use the preset
sqlContextto run the Hive query. The query retrieves the top 10 rows from a Hive table (hivesampletable) that comes with all HDInsight clusters by default. For more information on the
%%sqlmagic and the preset contexts, see Jupyter kernels available for an HDInsight cluster.
Every time you run a query in Jupyter, your web browser window title shows a (Busy) status along with the notebook title. You also see a solid circle next to the PySpark text in the top-right corner.
Run another query to see the data in
%%sql SELECT * FROM hivesampletable LIMIT 10
The screen shall refresh to show the query output.
From the File menu on the notebook, click Close and Halt. Shutting down the notebook releases the cluster resources.
If you plan to complete the next steps at a later time, make sure you delete the HDInsight cluster you created in this article.
Billing for HDInsight clusters is prorated per minute, whether you are using them or not. Be sure to delete your cluster after you have finished using it. For more information, see How to delete an HDInsight cluster.
In this article, you learned how to create an HDInsight Spark cluster and run a basic Spark SQL query. Advance to the next article to learn how to use an HDInsight Spark cluster to run interactive queries on sample data.