Best practices: Data governance on Azure Databricks

This document describes the need for data governance and shares best practices and strategies you can use to implement these techniques across your organization. It demonstrates a typical deployment workflow you can employ using Azure Databricks and cloud-native solutions to secure and monitor each layer from the application down to storage.

Why is data governance important?

Data governance is an umbrella term that encapsulates the policies and practices implemented to securely manage the data assets within an organization. As one of the key tenets of any successful data governance practice, data security is likely to be top of mind at any large organization. Key to data security is the ability for data teams to have superior visibility and auditability of user data access patterns across their organization. Implementing an effective data governance solution helps companies protect their data from unauthorized access and ensures that they have rules in place to comply with regulatory requirements.

Governance challenges

Whether you’re managing the data of a startup or a large corporation, security teams and platform owners have the singular challenge of ensuring that this data is secure and is being managed according to the internal controls of the organization. Regulatory bodies the world over are changing the way we think about how data is both captured and stored. These compliance risks only add further complexity to an already tough problem. How then, do you open your data to those who can drive the use cases of the future? Ultimately, you should be adopting data policies and practices that help the business to realize value through the meaningful application of what can often be vast stores of data, stores that are growing all the time. We get solutions to the world’s toughest problems when data teams have access to many and disparate sources of data.

Typical challenges when considering the security and availability of your data in the cloud:

  • Do your current data and analytics tools support access controls on your data in the cloud? Do they provide robust logging of actions taken on the data as it moves through the given tool?
  • Will the security and monitoring solution you put in place now scale as demand on the data in your data lake grows? It can be easy enough to provision and monitor data access for a small number of users. What happens when you want to open up your data lake to hundreds of users? To thousands?
  • Is there anything you can do to be proactive in ensuring that your data access policies are being observed? It is not enough to simply monitor; that is just more data. If data availability is merely a challenge of data security, you should have a solution in place to actively monitor and track access to this information across the organization.
  • What steps can you take to identify gaps in your existing data governance solution?

How Azure Databricks addresses these challenges

  • Access control: Rich suite of access control all the way down to the storage layer. Azure Databricks can take advantage of its cloud backbone by utilizing state-of-the-art Azure security services right in the platform. Enable Azure Active Directory credential passthrough on your spark clusters to control access to your data lake.
  • Cluster policies: Enable administrators to control access to compute resources.
  • API first: Automate provisioning and permission management with the Databricks REST API.
  • Audit logging: Robust audit logs on actions and operations taken across the workspace delivered to your data lake. Azure Databricks can leverage the power of Azure to provide data access information across your deployment account and any others you configure. You can then use this information to power alerts that tip us off to potential wrongdoing.

The following sections illustrate how to use these Azure Databricks features to implement a governance solution.

Set up access control

To set up access control, you secure access to storage and implement fine-grained control of individual tables.

Implement table access control

You can enable table access control on Azure Databricks to programmatically grant, deny, and revoke access to your data from the Spark SQL API. You can control access to securable objects like databases, tables, views and functions. Consider a scenario where your company has a database to store financial data. You might want your analysts to create financial reports using that data. However, there might be sensitive information in another table in the database that analysts should not access. You can provide the user or group the privileges required to read data from one table, but deny all privileges to access the second table.

In the following illustration, Alice is an admin who owns the shared_data and private_data tables in the Finance database. Alice then provides Oscar, an analyst, with the privileges required to read from shared_data but denies all privileges to private_data.

Grant select

Alice grants SELECT privileges to Oscar to read from shared_data:

Grant select table

Alice denies all privileges to Oscar to access private_data:

Deny statement

You can take this one step further by defining fine-grained access controls to a subset of a table or by setting privileges on derived views of a table.

Deny table

Secure access to Azure Data Lake Storage

You can access data in Azure Data Lake Storage from Azure Databricks clusters in a couple of ways. The methods discussed here mainly correspond to how the data being accessed will be used in the corresponding workflow. That is, will you be accessing your data in a more interactive, ad-hoc way, perhaps developing an ML model or building an operational dashboard? In that case, we recommend that you use Azure Active Directory (Azure AD) credential passthrough. Will you be running automated, scheduled workloads that require one-off access to the containers in your data lake? Then using service principals to access Azure Data Lake Storage is preferred.

Credential passthrough

Credential passthrough provides user-scoped data access controls to any provisioned file stores based on the user’s role based access controls. When you configure a cluster, select and expand Advanced Options to enable credential passthrough. Any users who attempt to access data on the cluster will be governed by the access controls put in place on their corresponding file system resources, according to their Active Directory account.

Cluster permission

This solution is suitable for many interactive use cases and offers a streamlined approach, requiring that you manage permissions in just one place. In this way, you can allocate one cluster to multiple users without having to worry about provisioning specific access controls for each of your users. Process isolation on Azure Databricks clusters ensures that user credentials will not be leaked or otherwise shared. This approach also has the added benefit of logging user-level entries in your Azure storage audit logs, which can help platform admins to associate storage layer actions with specific users.

Some limitations to this method are:

  • Supports only Azure Data Lake Storage file systems.
  • Databricks REST API access.
  • Table access control: Azure Databricks does not suggest using credential passthrough with table access control. For more details on the limitations of combining these two features, see Limitations. For more information about using table access control, see Implement table access control.
  • Not suitable for long-running jobs or queries, because of the limited time-to-live on a user’s access token. For these types of workloads, we recommend that you use service principals to access your data.

Securely mount Azure Data Lake Storage using credential passthrough

You can mount an Azure Data Lake Storage account or folder inside it to the Databricks File System (DBFS), providing an easy and secure way to access data in your data lake. The mount is a pointer to a data lake store, so the data is never synced locally. When you mount data using a cluster enabled with Azure Data Lake Storage credential passthrough, any read or write to the mount point uses your Azure AD credentials. This mount point will be visible to other users, but the only users that will have read and write access are those who:

  • Have access to the underlying Azure Data Lake Storage storage account
  • Are using a cluster enabled for Azure Data Lake Storage credential passthrough

To mount Azure Data Lake Storage using credential passthrough, follow the instructions in Mount Azure Data Lake Storage to DBFS using credential passthrough.

Service principals

How do you grant access to users or service accounts for more long-running or frequent workloads? What if you want to utilize a business intelligence tool, such as Power BI or Tableau, that needs access to the tables in Azure Databricks via ODBC/JDBC? In these cases, you should use service principals and OAuth. Service principals are identity accounts scoped to very specific Azure resources. When building a job in a notebook, you can add the following lines to the job cluster’s Spark configuration or run directly in the notebook. This allows you to access the corresponding file store within the scope of the job.

spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

Similarly, you can access said data by reading directly from an Azure Data Lake Storage Gen1 or Gen2 URI by mounting your file store(s) with a service principal and an OAuth token. Once you’ve set the configuration above, you can now access files directly in your Azure Data Lake Storage using the URI:

"abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>"

All users on a cluster with a file system registered in this way will have access to the data in the file system.

Manage cluster configurations

Cluster policies allow Azure Databricks administrators to define cluster attributes that are allowed on a cluster, such as instance types, number of nodes, custom tags, and many more. When an admin creates a policy and assigns it to a user or a group, those users can only create clusters based on the policy they have access to. This gives administrators a much higher degree of control on what types of clusters can be created.

You define policies in a JSON policy definition and then create cluster policies using the cluster policies UI or Cluster Policies API. A user can create a cluster only if they have the create_cluster permission or access to at least one cluster policy. Extending your requirements for the new analytics project team, as described above, administrators can now create a cluster policy and assign it to one or more users within the project team who can now create clusters for the team limited to the rules specified in the cluster policy. The image below provides an example of a user that has access to the Project Team Cluster Policy creating a cluster based on the policy definition.

Cluster policy

Automatically provision clusters and grant permissions

With the addition of endpoints for both clusters and permissions, the Databricks REST API 2.0 makes it easy to both provision and grant permission to cluster resources for users and groups at any scale. You can use the Clusters API to create and configure clusters for your specific use case.

You can then use the Permissions API to apply access controls to the cluster.

Important

The Permissions API is in Private Preview. To enable this feature for your Azure Databricks workspace, contact your Azure Databricks representative.

The following is an example of a configuration that might suit a new analytics project team.

The requirements are:

  • Support the interactive workloads of this team, who are mostly SQL and Python users.
  • Provision a data source in object storage with credentials that give the team access to the data tied to the role.
  • Ensure that users get an equal share of the cluster’s resources.
  • Provision larger, memory optimized instance types.
  • Grant permissions to the cluster such that only this new project team has access to it.
  • Tag this cluster to make sure you can properly do chargebacks on any compute costs incurred.

Deployment script

You deploy this configuration by using the API endpoints in the Clusters and Permissions APIs.

Provision cluster

Endpoint - https:///<databricks-instance>/api/2.0/clusters/create


{
  "autoscale": {
      "min_workers": 2,
      "max_workers": 20
  },
  "cluster_name": "project team interactive cluster",
  "spark_version": "latest-stable-scala2.11",
  "spark_conf": {
      "spark.Azure Databricks.cluster.profile": "serverless",
      "spark.Azure Databricks.repl.allowedLanguages": "python,sql",
      "spark.Azure Databricks.passthrough.enabled": "true",
      "spark.Azure Databricks.pyspark.enableProcessIsolation": "true"
  },
  "node_type_id": "Standard_D14_v2",
  "ssh_public_keys": [],
  "custom_tags": {
      "ResourceClass": "Serverless",
      "team": "new-project-team"
  },
  "spark_env_vars": {
      "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
  },
  "autotermination_minutes": 60,
  "enable_elastic_disk": true,
  "init_scripts": []
}

Grant cluster permission

Endpoint - https://<databricks-instance>/api/2.0/permissions/clusters/<cluster_id>

{
  "access_control_list": [
    {
      "group_name": "project team",
      "permission_level": "CAN_MANAGE"
    }
  ]
}

Instantly you have a cluster that has been provisioned with secure access to critical data in the lake, locked down to all but the corresponding team, tagged for chargebacks, and configured to meet the requirements of the project. There are additional configuration steps within your host cloud provider account required to implement this solution, though too, can be automated to meet the requirements of scale.

Audit access

Configuring access control in Azure Databricks and controlling data access in storage is the first step towards an efficient data governance solution. However, a complete solution requires auditing access to data and providing alerting and monitoring capabilities. Databricks provides a comprehensive set of audit events to log activities provided by Azure Databricks users, allowing enterprises to monitor detailed usage patterns on the platform. To get a complete understanding of what users are doing on the platform and what data is being accessed, you should use both native Azure Databricks and cloud provider audit logging capabilities.

Configuring access controls in Azure Databricks and controlling data access in the storage account is a great first step towards an efficient data governance solution. However, it is incomplete until you can audit access to data and provide alerting and monitoring capabilities. Azure Databricks provides a comprehensive set of audit events to log activities performed by users allowing enterprises to monitor detailed usage patterns on the platform.

Make sure you have diagnostic logging enabled in Azure Databricks. Once logging is enabled for your account, Azure Databricks automatically starts sending diagnostic logs to the delivery location you specified. You also have the option to Send to Log Analytics, which will forward diagnostic data to Azure Monitor. Here is an example query you can enter into the Log search box to query all users who have logged into the Azure Databricks workspace and their location:

Azure monitor

In a few steps, you can use Azure monitoring services or create real-time alerts. The Azure Activity Log provides visibility into the actions taken on your storage accounts and the containers therein. Alert rules can be configured here as well.

Azure activity log

Learn more

Here are some resources to help you build a comprehensive data governance solution that meets your organization’s needs: