How to use Hive metastore with Apache Spark™ cluster

Important

This feature is currently in preview. The Supplemental Terms of Use for Microsoft Azure Previews include more legal terms that apply to Azure features that are in beta, in preview, or otherwise not yet released into general availability. For information about this specific preview, see Azure HDInsight on AKS preview information. For questions or feature suggestions, please submit a request on AskHDInsight with the details and follow us for more updates on Azure HDInsight Community.

It's essential to share the data and metastore across multiple services. One of the commonly used metastore in HIVE metastore. HDInsight on AKS allows users to connect to external metastore. This step enables the HDInsight users to seamlessly connect to other services in the ecosystem.

Azure HDInsight on AKS supports custom meta stores, which are recommended for production clusters. The key steps involved are

  1. Create Azure SQL database
  2. Create a key vault for storing the credentials
  3. Configure Metastore while you create a HDInsight on AKS cluster with Apache Spark™
  4. Operate on External Metastore (Shows databases and do a select limit 1).

While you create the cluster, HDInsight service needs to connect to the external metastore and verify your credentials.

Create Azure SQL database

  1. Create or have an existing Azure SQL Database before setting up a custom Hive metastore for an HDInsight cluster.

    Note

    Currently, we support only Azure SQL Database for HIVE metastore. Due to Hive limitation, "-" (hyphen) character in metastore database name is not supported.

Create a key vault for storing the credentials

  1. Create an Azure Key Vault.

    The purpose of the Key Vault is to allow you to store the SQL Server admin password set during SQL database creation. HDInsight on AKS platform doesn’t deal with the credential directly. Hence, it's necessary to store your important credentials in Azure Key Vault. Learn the steps to create an Azure Key Vault.

  2. Post the creation of Azure Key Vault assign the following roles

    Object Role Remarks
    User Assigned Managed Identity(the same UAMI as used by the HDInsight cluster) Key Vault Secrets User Learn how to Assign role to UAMI
    User(who creates secret in Azure Key Vault) Key Vault Administrator Learn how to Assign role to user.

    Note

    Without this role, user can't create a secret.

  3. Create a secret

    This step allows you to keep your SQL server admin password as a secret in Azure Key Vault. Add your password(same password as provided in the SQL DB for admin) in the “Value” field while adding a secret.

    Screenshot showing how to create a key vault.

    Screenshot showing how to create a secret.

    Note

    Make sure to note the secret name, as you'll need this during cluster creation.

Configure Metastore while you create a HDInsight Spark cluster

  1. Navigate to HDInsight on AKS Cluster pool to create clusters.

    Screenshot showing how to create new cluster.

  2. Enable the toggle button to add external hive metastore and fill in the following details.

    Screenshot showing the basic tab.

  3. The rest of the details are to be filled in as per the cluster creation rules for Apache Spark cluster in HDInsight on AKS.

  4. Click on Review and Create.

    Screenshot showing the review and create tab.

    Note

    • The lifecycle of the metastore isn't tied to a clusters lifecycle, so you can create and delete clusters without losing metadata. Metadata such as your Hive schemas persist even after you delete and re-create the HDInsight cluster.
    • A custom metastore lets you attach multiple clusters and cluster types to that metastore.

Operate on External Metastore

  1. Create a table

    >> spark.sql("CREATE TABLE sampleTable (number Int, word String)")

    Screenshot showing how to create table.

  2. Add data on the table

    >> spark.sql("INSERT INTO sampleTable VALUES (123, \"HDIonAKS\")");\

    Screenshot showing insert statement.

  3. Read the table

    >> spark.sql("select * from sampleTable").show()

    Screenshot showing how to read table.

Reference