Create an Apache Spark cluster in Azure HDInsight

Learn how to create Apache Spark cluster on Azure HDInsight, and how to run Spark SQL queries against Hive tables. For information on Spark on HDInsight, see Overview: Apache Spark on Azure HDInsight.

Prerequisites

Create HDInsight Spark cluster

Create an HDInsight Spark cluster using an Azure Resource Manager template. The template can be found in github. For other cluster creation methods, see Create HDInsight clusters.

  1. Click the following image to open the template in the Azure portal.

    Deploy to Azure

  2. Enter the following values:

    Create HDInsight Spark cluster using an Azure Resource Manager template

    • Subscription: Select your Azure subscription used for creating this cluster.
    • Resource group: Create a resource group or select an existing one. Resource group is used to manage Azure resources for your projects.
    • Location: Select a location for the resource group. The template uses this location for creating the cluster as well as for the default cluster storage.
    • ClusterName: Enter a name for the HDInsight cluster that you want to create.
    • Cluster login name and password: The default login name is admin.
    • SSH user name and password.
  3. Select I agree to the terms and conditions stated above, select Pin to dashboard, and then click Purchase. You can see a new tile titled Submitting deployment for Template deployment. It takes about 20 minutes to create the cluster.

If you run into an issue with creating HDInsight clusters, it could be that you do not have the right permissions to do so. For more information, see Access control requirements.

Note

This article creates a Spark cluster that uses Azure Storage Blobs as the cluster storage. You can also create a Spark cluster that uses Azure Data Lake Store as the default storage. For instructions, see Create an HDInsight cluster with Data Lake Store.

Create a Jupyter notebook

Jupyter Notebook is an interactive notebook environment that supports various programming languages that allow you to interact with your data, combine code with markdown text and perform simple visualizations. Spark on HDInsight also includes Zeppelin Notebook. Jupyter Notebook is used in this tutorial.

To create a Jupyter notebook

  1. Open the Azure portal.

  2. Open the Spark cluster you created. For the instructions, see List and show clusters.

  3. From Quick links, click Cluster dashboards, and then click Jupyter Notebook. If prompted, enter the admin credentials for the cluster.

    Open Jupyter Notebook to run interactive Spark SQL query

    Note

    You may also access the Jupyter Notebook for your cluster by opening the following URL in your browser. Replace CLUSTERNAME with the name of your cluster:

    https://CLUSTERNAME.azurehdinsight.net/jupyter

  4. Click New, and then click PySpark to create a notebook. Jupyter notebooks on HDInsight clusters support three kernels - PySpark, PySpark3, and Spark. The PySpark kernel is used in this tutorial. For more information about the kernels, and the benefits of using PySpark, see Use Jupyter notebook kernels with Apache Spark clusters in HDInsight.

    Create a Jupyter Notebook to run interactive Spark SQL query

    A new notebook is created and opened with the name Untitled(Untitled.pynb).

  5. Click the notebook name at the top, and enter a friendly name if you want.

    Provide a name for the Jupyter Notebook to run interactive Spark query from

Run Spark SQL statements on a Hive table

SQL (Structured Query Language) is the most common and widely used language for querying and defining data. Spark SQL functions as an extension to Apache Spark for processing structured data, using the familiar SQL syntax.

Spark SQL supports both SQL and HiveQL as query languages. Its capabilities include binding in Python, Scala, and Java. With it, you can query data stored in many locations, such as external databases, structured data files (example: JSON), and Hive tables.

For an example of reading data from a csv file instead of a Hive table, see Run interactive queries on Spark clusters in HDInsight.

To run Spark SQL

  1. From the notebook, paste the following code in an empty cell, and then press SHIFT + ENTER to run the code.

    %%sql
    SELECT * FROM hivesampletable LIMIT 10
    

    Hive query in HDInsight Spark

    When you use a Jupyter Notebook with your HDInsight Spark cluster, you get a preset sqlContext that you can use to run Hive queries using Spark SQL. %%sql tells Jupyter Notebook to use the preset sqlContext to run the Hive query. The query retrieves the top 10 rows from a Hive table (hivesampletable) that comes with all HDInsight clusters by default. For more information on the %%sql magic and the preset contexts, see Jupyter kernels available for an HDInsight cluster.

    Every time you run a query in Jupyter, your web browser window title shows a (Busy) status along with the notebook title. You also see a solid circle next to the PySpark text in the top-right corner. After the job is completed, it changes to a hollow circle.

    The screen shall refresh to show the query output.

    Hive query output in HDInsight Spark

  2. From the File menu on the notebook, click Close and Halt. Shutting down the notebook releases the cluster resources.

  3. If you plan to complete the next steps at a later time, make sure you delete the HDInsight cluster you created in this article.

Warning

Billing for HDInsight clusters is prorated per minute, whether you are using them or not. Be sure to delete your cluster after you have finished using it. For more information, see How to delete an HDInsight cluster.

Next step

In this article, you learned how to create an HDInsight Spark cluster and run a basic Spark SQL query. Advance to the next article to learn how to use an HDInsight Spark cluster to run interactive queries on sample data.