Connect to and manage Hive Metastore databases in Azure Purview

This article outlines how to register Hive Metastore databases, and how to authenticate and interact with Hive Metastore databases in Azure Purview. For more information about Azure Purview, read the introductory article.

Supported capabilities

Metadata extraction Full scan Incremental scan Scoped scan Classification Access policy Lineage
Yes Yes No No No No Yes**

** Lineage is supported if the dataset is used as a source or sink in the Azure Data Factory Copy activity.

Important

The supported platforms are Apache Hadoop, Cloudera, Hortonworks, and Azure Databricks.

Supported Hive versions are 2.x to 3.x. Supported Azure Databricks versions are 8.0 and later.

Prerequisites

Register

This section describes how to register a Hive Metastore database in Azure Purview by using Azure Purview Studio.

The only supported authentication for a Hive Metastore database is Basic Authentication.

  1. Go to your Azure Purview account.

  2. Select Data Map on the left pane.

  3. Select Register.

  4. In Register sources, select Hive Metastore > Continue.

  5. On the Register sources (Hive Metastore) screen, do the following:

    1. For Name, enter a name that Azure Purview will list as the data source.

    2. For Hive Cluster URL, enter a value that you get from the Ambari URL or the Azure Databricks workspace URL. For example, enter hive.azurehdinsight.net or adb-19255636414785.5.azuredatabricks.net.

    3. For Hive Metastore Server URL, enter a URL for the server. For example, enter sqlserver://hive.database.windows.net or jdbc:spark://adb-19255636414785.5.azuredatabricks.net:443.

    4. For Select a collection, choose a collection from the list or create a new one. This step is optional.

    Screenshot that shows boxes for registering Hive sources.

  6. Select Finish.

Scan

Use the following steps to scan Hive Metastore databases to automatically identify assets and classify your data. For more information about scanning in general, see Scans and ingestion in Azure Purview.

  1. In the Management Center, select integration runtimes. Make sure that a self-hosted integration runtime is set up. If it isn't set up, use the steps in Create and manage a self-hosted integration runtime.

  2. Go to Sources.

  3. Select the registered Hive Metastore database.

  4. Select + New scan.

  5. Provide the following details:

    1. Name: Enter a name for the scan.

    2. Connect via integration runtime: Select the configured self-hosted integration runtime.

    3. Credential: Select the credential to connect to your data source. Make sure to:

      • Select Basic Authentication while creating a credential.
      • Provide the Metastore username in the appropriate box.
      • Store the Metastore password in the secret key.

      For more information, see Credentials for source authentication in Azure Purview.

      Azure Databricks usage: Go to your Azure Databricks cluster, select Apps, and then select Launch Web Terminal. Run the cmdlet cat /databricks/hive/conf/hive-site.xml.

      You can also access the username and password from the following two properties:

      Screenshot that shows Azure Databricks username and password examples as property values.

    4. Metastore JDBC Driver Location: Specify the path to the JDBC driver location on your machine where the self-hosted integration runtime is running. This should be a valid path to the folder for JAR files.

      If you're scanning Azure Databricks, refer to the information on Azure Databricks in the next step.

      Note

      The driver should be accessible to all accounts in the machine. Don't install it in a user account.

    5. Metastore JDBC Driver Class: Provide the class name for the connection driver. For example, enter \com.microsoft.sqlserver.jdbc.SQLServerDriver.

      Azure Databricks usage: Go to your Azure Databricks cluster, select Apps, and then select Launch Web Terminal. Run the cmdlet cat /databricks/hive/conf/hive-site.xml.

      You can access the driver class from the following property:

      Screenshot that shows a driver class as a property value.

    6. Metastore JDBC URL: Provide the connection URL value and define the connection to the URL of the Metastore database server. For example: jdbc:sqlserver://hive.database.windows.net;database=hive;encrypt=true;trustServerCertificate=true;create=false;loginTimeout=300.

      Azure Databricks usage: Go to your Azure Databricks cluster, select Apps, and then select Launch Web Terminal. Run the cmdlet cat /databricks/hive/conf/hive-site.xml.

      You can access the JDBC URL from the connection URL property, as shown in the following screenshot:

      Screenshot that shows an example connection U R L property.

      Note

      When you copy the URL from hive-site.xml, remove amp; from the string or the scan will fail. Then append the path to your SSL certificate to the URL. This will be the path to the SSL certificate's location on your machine. Download the SSL certificate.

      When you enter local file system paths in the Purview Studio scan configuration, remember to change the Windows path separator character from a backslash (\) to a forward slash (/). For example, if your MariaDB JAR file is C:\mariadb-jdbc.jar, change it to C:/mariadb-jdbc.jar. Make the same change to the Metastore JDBC URL sslCA parameter. For example, if it's placed at local file system path D:\Drivers\SSLCert\BaltimoreCyberTrustRoot.crt.pem, change it to D:/Drivers/SSLCert/BaltimoreCyberTrustRoot.crt.pem.

      The Metastore JDBC URL value will look like this example:

      jdbc:mariadb://consolidated-westus2-prod-metastore-addl-1.mysql.database.azure.com:3306/organization1829255636414785?trustServerCertificate=true&useSSL=true&sslCA=D:/Drivers/SSLCert/BaltimoreCyberTrustRoot.crt.pem

    7. Metastore database name: Provide the name of the Hive Metastore database.

      Azure Databricks usage: Go to your Azure Databricks cluster, select Apps, and then select Launch Web Terminal. Run the cmdlet cat /databricks/hive/conf/hive-site.xml.

      You can access the database name from the JDBC URL property, as shown in the following screenshot.

      Screenshot that shows an example database name as a J D B C property.

    8. Schema: Specify a list of Hive schemas to import. For example: schema1; schema2.

      All user schemas are imported if that list is empty. All system schemas (for example, SysAdmin) and objects are ignored by default.

      When the list is empty, all available schemas are imported. Acceptable schema name patterns that use SQL LIKE expression syntax include the percent sign (%). For example, A%; %B; %C%; D means:

      • Start with A or
      • End with B or
      • Contain C or
      • Equal D

      Usage of NOT and special characters is not acceptable.

    9. Maximum memory available: Maximum memory (in gigabytes) available on the customer's machine for the scanning processes to use. This value is dependent on the size of Hive Metastore database to be scanned.

    Screenshot that shows boxes for scan details.

  6. Select Continue.

  7. For Scan trigger, choose whether to set up a schedule or run the scan once.

  8. Review your scan and select Save and Run.

Next steps

Now that you've registered your source, use the following guides to learn more about Azure Purview and your data: