Azure Synapse Analytics security white paper: Data protection

Article
01/26/2023

Note

This article forms part of the Azure Synapse Analytics security white paper series of articles. For an overview of the series, see Azure Synapse Analytics security white paper.

Data discovery and classification

Organizations need to protect their data to comply with federal, local, and company guidelines to mitigate risks of data breach. One challenge organizations face is: How do you protect the data if you don't know where it is? Another is: What level of protection is needed?—because some datasets require more protection than others.

Imagine an organization with hundreds or thousands of files stored in their data lake, and hundreds or thousands of tables in their databases. It would benefit from a process that automatically scans every row and column of the file system or table and classifies columns as potentially sensitive data. This process is known as data discovery.

Once the data discovery process is complete, it provides classification recommendations based on a predefined set of patterns, keywords, and rules. Someone can then review the recommendations and apply sensitivity-classification labels to appropriate columns. This process is known as classification.

Azure Synapse provides two options for data discovery and classification:

Data Discovery & Classification, which is built into Azure Synapse and dedicated SQL pool (formerly SQL DW).
Microsoft Purview, which is a unified data governance solution that helps manage and govern on-premises, multicloud, and software-as-a-service (SaaS) data. It can automate data discovery, lineage identification, and data classification. By producing a unified map of data assets and their relationships, it makes data easily discoverable.

Note

Microsoft Purview data discovery and classification is in public preview for Azure Synapse, dedicated SQL pool (formerly SQL DW), and serverless SQL pool. However, data lineage is currently not supported for Azure Synapse, dedicated SQL pool (formerly SQL DW), and serverless SQL pool. Apache Spark pool only supports lineage tracking.

Data encryption

Data is encrypted at rest and in transit.

Data at rest

By default, Azure Storage automatically encrypts all data using 256-bit Advanced Encryption Standard encryption (AES 256). It's one of the strongest block ciphers available and is FIPS 140-2 compliant. The platform manages the encryption key, and it forms the first layer of data encryption. This encryption applies to both user and system databases, including the master database.

Enabling Transparent Data Encryption (TDE) can add a second layer of data encryption for dedicated SQL pools. It performs real-time I/O encryption and decryption of database files, transaction logs files, and backups at rest without requiring any changes to the application. By default, it uses AES 256.

By default, TDE protects the database encryption key (DEK) with a built-in server certificate (service managed). There's an option to bring your own key (BYOK) that can be securely stored in Azure Key Vault.

Azure Synapse SQL serverless pool and Apache Spark pool are analytic engines that work directly on Azure Data Lake Gen2 (ALDS Gen2) or Azure Blob Storage. These analytic runtimes don't have any permanent storage and rely on Azure Storage encryption technologies for data protection. By default, Azure Storage encrypts all data using server-side encryption (SSE). It's enabled for all storage types (including ADLS Gen2) and cannot be disabled. SSE encrypts and decrypts data transparently using AES 256.

There are two SSE encryption options:

Microsoft-managed keys: Microsoft manages every aspect of the encryption key, including key storage, ownership, and rotations. It's entirely transparent to customers.
Customer-managed keys: In this case, the symmetric key used to encrypt data in Azure Storage is encrypted using a customer-provided key. It supports RSA and RSA-HSM (Hardware Security Modules) keys of sizes 2048, 3072, and 4096. Keys can be securely stored in Azure Key Vault or Azure Key Vault Managed HSM. It provides fine grain access control of the key and its management, including storage, backup, and rotations. For more information, see Customer-managed keys for Azure Storage encryption.

While SSE forms the first layer of encryption, cautious customers can double encrypt by enabling a second layer of 256-bit AES encryption at the Azure Storage infrastructure layer. Known as infrastructure encryption, it uses a platform-managed key together with a separate key from SSE. So, data in the storage account is encrypted twice; once at the service level and once at the infrastructure level with two different encryption algorithms and different keys.

Data in transit

Azure Synapse, dedicated SQL pool (formerly SQL DW), and serverless SQL pool use the Tabular Data Stream (TDS) protocol to communicate between the SQL pool endpoint and a client machine. TDS depends on Transport Layer Security (TLS) for channel encryption, ensuring all data packets are secured and encrypted between endpoint and client machine. It uses a signed server certificate from the Certificate Authority (CA) used for TLS encryption, managed by Microsoft. Azure Synapse supports data encryption in transit with TLS v1.2, using AES 256 encryption.

Azure Synapse leverages TLS to ensure data is encrypted in motion. SQL dedicated pools support TLS 1.0, TLS 1.1, and TLS 1.2 versions for encryption wherein Microsoft-provided drivers use TLS 1.2 by default. Serverless SQL pool and Apache Spark pool use TLS 1.2 for all outbound connections.

Next steps

In the next article in this white paper series, learn about access control.