Data catalog

Article
10/31/2022

The data catalog registers and maintains the data information in a centralized place and makes it available for the organization. It ensures that enterprises avoid duplicate data products caused by redundant data ingestion by different project teams.

We recommend you provision a data catalog service to define the metadata of the data products stored across the data landing zones.

Cloud-scale analytics relies on Microsoft Purview to serve as:

A system of registration
A discovery for enterprise data sources
A data classification engine
A policy store
An API for registering and reading data information
A compliance dashboard hub

Because the data catalog is part of the data management landing zone, it can communicate with each data landing zone via its virtual network (VNet) peering and self-hosted integration runtimes. Discovery of data products in on-premises stores and other public clouds is achieved by more deployments of self-hosted integration runtimes.

Note

Although this documentation focuses primarily on using Microsoft Purview for data catalog capabilities and data classification, enterprises might have invested in other products, such as Alation, Okera, or Collibra. If this is the case, work with your vendor to apply the principles described for a data management landing zone as nearby as possible. Be aware that some custom integration might be required.

Data discovery

Data discovery reflects the state of all the data that the enterprise owns. This data is known as the data estate. During data discovery, the data estate is scanned and classified. The data scanning process connects directly to the data source according to a set schedule.

As you add a new data landing zone to the environment, the associated data lakes and polyglot persistence sources are registered as sources for the data catalog crawlers to scan.

With automated discovery of your data estate to populate the catalog, you can:

Crawl metadata from Azure and on-premises data sources
Scan your data lakes, blobs, and other supported targets
Extract schema from your data targets for XML, TSV, CSV, PSV, SSV, JSON, Parquet, Avro, and ORC file types
Allow automated catalog updates through configurable scheduling of scans and scan rule sets

Important

When you add a new data landing zone to the environment, register the associated data lakes and polyglot storage through Azure DevOps as a source for the data catalog crawlers to scan.

Data classification

Microsoft Purview allows you to apply system or custom data classifications on file, table, or column assets.

Data classifications are like subject tags. Microsoft Purview marks and identifies the content of specific data types found within your data estate during scanning. You use sensitivity labels to identify the categories of classification types within your organizational data. You can also use sensitivity labels to group the policies you wish to apply to each category. Microsoft Purview makes use of the same sensitive information types as Microsoft 365, allowing you to stretch your existing security policies and protections across your entire content and data estate.

Microsoft Purview can scan and automatically classify documents. For example, if you have a file named multiple.docx and it has a national ID number in its content, Microsoft Purview adds a classification such as EU National Identification Number in the asset detail page.

Microsoft Defender for SQL is a feature available for Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics. It includes functionality for discovering and classifying sensitive data, surfacing and mitigating potential database vulnerabilities, and detecting anomalous activities that could indicate a threat to your database. Microsoft Defender for SQL provides a single go-to location for enabling and managing these capabilities.

Next steps

Data lineage

Data catalog

Data discovery

Data classification

Next steps

Feedback

Feedback

Additional resources