Encryption at rest concepts and configuration guide

Starting from SQL Server Big Data Clusters CU8, a comprehensive encryption at rest feature set is available to provide application level encryption to all data stored in the platform. This guide documents the concepts, architecture, and configuration for the encryption at rest feature set for SQL Server Big Data Clusters.

SQL Server Big Data Clusters stores data in the following two locations:

  • SQL Server master instance
  • HDFS used by Storage pool and Spark.

To be able to transparently encrypt data in SQL Server Big Data Clusters, there are two possible approaches:

  • Volume encryption. This approach is supported by the Kubernetes platform and is expected as a best practice for Big Data Clusters deployments. This guide does not cover volume encryption. Consult your Kubernetes platform or appliance documentation for guides on how to properly encrypt volumes that will be used for SQL Server Big Data Clusters.
  • Application level encryption. This architecture refers to the encryption of data by the application handling the data before it is written to disk. In case the volumes are exposed, an attacker wouldn’t be able to restore data artifacts elsewhere, unless the destination system also has been configured with the same encryption keys.

The Encryption at Rest feature set of SQL Server Big Data Clusters supports the core scenario of application level encryption for the SQL Server and HDFS components.

The following capabilities are provided:

  • System-managed encryption at rest. This capability is available in CU8.
  • User-managed encryption at rest (BYOK), with both service-managed and external key provider integrations. Currently only service-managed user created keys are supported.

Key Definitions

SQL Server Big Data Clusters key management service (KMS)

A Controller hosted service responsible for managing keys and certificates for the Encryption at Rest feature set for the SQL Server BDC cluster. It’s a service that supports the following features:

  • Secure management and storage of keys and certificates used for encryption at rest.
  • Hadoop KMS compatibility. It acts as the key management service for HDFS component on BDC.
  • SQL Server TDE certificate management.

We will reference this service as BDC KMS throughout the rest of this document. Also the term BDC is used to refer to the SQL Server Big Data Clusters computing platform.

System-managed keys and certificates

The BDC KMS service will manage all keys and certificates for SQL Server and HDFS.

User provided certificates

User provided keys and certificates to be managed by BDC KMS, commonly known as bring your own key (BYOK).

External providers

External key solutions compatible with BDC KMS for external delegation. This capability isn't supported at this time.

Encryption at rest on SQL Server Big Data Clusters

Read this document carefully to completely assess your scenario.

The feature set introduces the BDC KMS controller service to provide system-managed keys and certificates for data encryption at rest on both SQL Server and HDFS. Those keys and certificates are service-managed and this documentation provides operational guidance on how to interact with the service.

  • SQL Server instances leverage the established Transparent Data Encryption (TDE) functionality.
  • HDFS uses native Hadoop KMS within each pod to interact with BDC KMS on the controller. This enables HDFS encryption zones, which provide secure paths on HDFS.

SQL Server instances

  • A system-generated certificate will be installed on SQL Server pods to be used with TDE commands. The system-managed certificate naming standard is TDECertificate + timestamp. For example, TDECertificate2020_09_15_22_46_27.
  • Master instance BDC provisioned databases and user databases won’t be encrypted automatically. DBAs may use the installed certificate to encrypt any database.
  • Compute pool and storage pool will be automatically encrypted using the system-generated certificate.
  • Data pool encryption, albeit technically possible using T-SQL EXECUTE AT commands, is discouraged and unsupported at this time. Using this technique to encrypt data pool databases might not be effective and encryption may not be happening at the desired state. It also creates an incompatible upgrade path towards next releases.
  • SQL Server key rotation is achieved using standard T-SQL administrative commands. Please read SQL Server Big Data Clusters transparent data encryption (TDE) at rest usage guide for complete instructions.
  • Encryption monitoring happens through existing standard SQL Server DMVs for TDE.
  • It is supported to back up and restore a TDE enabled database into the cluster.
  • HA is supported. If a database on the primary instance of SQL Server is encrypted, then all secondary replica of the database will be encrypted as well.

HDFS encryption zones

  • Active Directory integration is required to enable the encryption zones feature for HDFS.
  • A system-generated key will be provisioned in Hadoop KMS. The key name is securelakekey. On CU8 the default key is 256-bit and we support 256-bit AES encryption.
  • A default encryption zone will be provisioned using the above system-generated key on a path named /securelake.
  • Users can create additional keys and encryption zones using specific instructions provided in this guide. Users will be able to choose the key size of 128, 192, or 256 during key creation.
  • HDFS Encryption Zones key rotation is achieved using azdata. Please read SQL Server Big Data Clusters HDFS Encryption Zones usage guide for complete instructions.
  • It's not supported to perform HDFS Tiering mounting on top of an encryption zone.

Encryption at Rest Administration

The following list contains the administration capabilities for Encryption at Rest

  • SQL Server TDE management is performed using standard T-SQL commands.
  • HDFS Encryption Zones and HDFS key management is performed using azdata commands.
  • The following administration features are performed using Operational Notebooks:
    • HDFS key backup and recover
    • HDFS key deletion

Configuration guide

During new deployments of SQL Server Big Data Clusters, CU8 onwards, encryption at rest will be enabled and configured by default. That means:

  • BDC KMS component will be deployed in the controller and will generate a default set of keys and certificates.
  • SQL Server will be deployed with TDE turned on and certificates will get installed by the controller.
  • Hadoop KMS (for HDFS) will be configured to interact with BDC KMS for encryption operations. HDFS encryption zones are configured and ready to use.

Requirements and default behaviors described in the previous section apply.

If performing a new deployment of SQL Server BDC CU8 onwards or upgrading directly to CU9, no additional steps are required.

Upgrade scenarios

On existing clusters, the upgrade process won't enforce new encryption or re-encryption on user data that was not already encrypted. This behavior is by design and the following needs to be considered per component:

  • SQL Server

    1. SQL Server master instance. The upgrade process won’t affect any master instance databases and installed TDE certificates, but it is highly encouraged to back up your databases and your manually installed TDE certificates before the upgrade process. It is also advised to store those artifacts outside the SQL Server BDC cluster.
    2. Compute and storage pool. Those databases are system-managed, volatile and will be recreated and automatically encrypted on cluster upgrade.
    3. Data pool. Upgrade does not impact databases in the SQL Server instances part of data pool.
  • HDFS

    1. HDFS. The upgrade process won't touch HDFS files and folders outside encryption zones.

Upgrading to CU9 from CU8 or earlier

No additional steps are required.

Upgrading to CU8 from CU6 or earlier

Caution

Before upgrading to SQL Server Big Data Clusters CU8 perform a complete backup of your data.

Encryption Zones won't be configured. The Hadoop KMS component won't be configured to use BDC KMS. In order to configure and enable HDFS encryption zones feature after upgrade follow instructions of the next section.

Enable HDFS encryption zones after upgrade to CU8

If you upgraded your cluster to CU8 (azdata upgrade) and want to enable HDFS encryption zones there are two options available:

  • Execution of the Azure Data Studio Operational Notebook named SOP0128 - Enable HDFS Encryption zones in Big Data Clusters to perform the configuration.
  • Script execution as described bellow.

Requirements:

  • Active Directory integrated cluster.

  • Azure Data CLI (azdata) configured and logged into the cluster in AD mode.

Follow the following procedure to reconfigure the cluster with encryption zones support.

  1. After performing the upgrade with azdata, save the following scripts.

    Scripts execution requirements:

    • Both scripts should be located in the same directory.
    • kubectl on `PATH
    • config file for Kubernetes in the folder $HOME/.kube

    This script should be named run-key-provider-patch.sh:

    #!/bin/bash
    
    if [[ -z "${BDC_DOMAIN}" ]]; then
      echo "BDC_DOMAIN environment variable with the domain name of the cluster is not defined."
      exit 1
    fi
    
    if [[ -z "${BDC_CLUSTER_NS}" ]]; then
      echo "BDC_CLUSTER_NS environment variable with the cluster namespace is not defined."
      exit 1
    fi
    
    kubectl get configmaps -n test
    
    diff <(kubectl get configmaps mssql-hadoop-storage-0-configmap -n $BDC_CLUSTER_NS -o json | ./updatekeyprovider.py) <(kubectl get configmaps mssql-hadoop-storage-0-configmap -n $BDC_CLUSTER_NS -o json | python -m json.tool)
    
    diff <(kubectl get configmaps mssql-hadoop-sparkhead-configmap -n $BDC_CLUSTER_NS -o json | ./updatekeyprovider.py) <(kubectl get configmaps mssql-hadoop-sparkhead-configmap -n $BDC_CLUSTER_NS -o json | python -m json.tool)
    
    # Replace the config maps.
    #
    kubectl replace -n $BDC_CLUSTER_NS -o json -f <(kubectl get configmaps mssql-hadoop-storage-0-configmap -n $BDC_CLUSTER_NS -o json | ./updatekeyprovider.py)
    
    kubectl replace -n $BDC_CLUSTER_NS -o json -f <(kubectl get configmaps mssql-hadoop-sparkhead-configmap -n $BDC_CLUSTER_NS -o json | ./updatekeyprovider.py)
    
    # Restart the pods which need to have the necessary changes with the core-site.xml
    kubectl delete pods -n $BDC_CLUSTER_NS nmnode-0-0
    kubectl delete pods -n $BDC_CLUSTER_NS storage-0-0
    kubectl delete pods -n $BDC_CLUSTER_NS storage-0-1
    
    # Wait for sometime for pods to restart
    #
    sleep 300
    
    # Check for the KMS process status.
    #
    kubectl exec -n $BDC_CLUSTER_NS -c hadoop nmnode-0-0 -- bash -c 'ps aux | grep kms'
    kubectl exec -n $BDC_CLUSTER_NS -c hadoop storage-0-0 -- bash -c 'ps aux | grep kms'
    kubectl exec -n $BDC_CLUSTER_NS -c hadoop storage-0-1 -- bash -c 'ps aux | grep kms'
    

    This script should be named updatekeyprovider.py:

    #!/usr/bin/env python3
    
    import json
    import re
    import sys
    import xml.etree.ElementTree as ET
    import os
    
    class CommentedTreeBuilder(ET.TreeBuilder):
        def comment(self, data):
            self.start(ET.Comment, {})
            self.data(data)
            self.end(ET.Comment)
    
    domain_name = os.environ['BDC_DOMAIN']
    
    parser = ET.XMLParser(target=CommentedTreeBuilder())
    
    core_site = 'core-site.xml'
    j = json.load(sys.stdin)
    cs = j['data'][core_site]
    csxml = ET.fromstring(cs, parser=parser)
    props = [prop.find('value').text for prop in csxml.findall(
        "./property/name/..[name='hadoop.security.key.provider.path']")]
    
    kms_provider_path=''
    
    for x in range(5):
        if len(kms_provider_path) != 0:
            kms_provider_path = kms_provider_path + ';'
        kms_provider_path = kms_provider_path + 'nmnode-0-0.' + domain_name
    
    if len(props) == 0:
        prop = ET.SubElement(csxml, 'property')
        name = ET.SubElement(prop, 'name')
        name.text = 'hadoop.security.key.provider.path'
        value = ET.SubElement(prop, 'value')
        value.text = 'kms://https@' + kms_provider_path + ':9600/kms'
        cs = ET.tostring(csxml, encoding='utf-8').decode('utf-8')
    
    j['data'][core_site] = cs
    
    kms_site = 'kms-site.xml.tmpl'
    ks = j['data'][kms_site]
    
    kp_uri_regex = re.compile('(<name>hadoop.kms.key.provider.uri</name>\s*<value>\s*)(.*)(\s*</value>)', re.MULTILINE)
    
    def replace_uri(match_obj):
        key_provider_uri = 'bdc://https@hdfsvault-svc.' + domain_name
        if match_obj.group(2) == 'jceks://file@/var/run/secrets/keystores/kms/kms.jceks' or match_obj.group(2) == key_provider_uri:
            return match_obj.group(1) + key_provider_uri + match_obj.group(3)
        return match_obj.group(0)
    
    ks = kp_uri_regex.sub(replace_uri, ks)
    
    j['data'][kms_site] = ks
    print(json.dumps(j, indent=4, sort_keys=True))
    

    Execute run-key-provider-patch.sh with the appropriate parameters.

Next steps

To learn more about how to effectively use encryption at rest SQL Server Big Data Clusters see the following articles:

To learn more about the SQL Server Big Data Clusters, see the following overview: