Supported data sources and file types

This article discusses currently supported data sources, file types, and scanning concepts in the Microsoft Purview Data Map.

Microsoft Purview Data Map available data sources

The table below shows the supported capabilities for each data source. Select the data source, or the feature, to learn more.

Category Data Store Technical metadata Classification Lineage Access Policy
Azure Azure Blob Storage Yes Yes Limited* Yes (Preview)
Azure Cosmos DB Yes Yes No* No
Azure Data Explorer Yes Yes No* No
Azure Data Factory Yes No Yes No
Azure Data Lake Storage Gen1 Yes Yes Limited* No
Azure Data Lake Storage Gen2 Yes Yes Limited* Yes (Preview)
Azure Data Share Yes No Yes No
Azure Database for MySQL Yes Yes No* No
Azure Database for PostgreSQL Yes Yes No* No
Azure Dedicated SQL pool (formerly SQL DW) Yes Yes No* No
Azure Files Yes Yes Limited* No
Azure SQL Database Yes Yes Yes (Preview) Yes (Preview)
Azure SQL Managed Instance Yes Yes No* No
Azure Synapse Analytics (Workspace) Yes Yes Yes - Synapse pipelines No
Database Amazon RDS Yes Yes No No
Cassandra Yes No Yes No
Db2 Yes No Yes No
Google BigQuery Yes No Yes No
Hive Metastore Database Yes No Yes* No
MongoDB Yes No No No
MySQL Yes No Yes No
Oracle Yes No Yes* No
PostgreSQL Yes No Yes No
SAP Business Warehouse Yes No No No
SAP HANA Yes No No No
Snowflake Yes No Yes No
SQL Server Yes Yes No* No
SQL Server on Azure-Arc No No No Yes (Preview)
Teradata Yes No Yes* No
File Amazon S3 Yes Yes Limited* No
Services and apps Erwin Yes No Yes No
Looker Yes No Yes No
Power BI Yes No Yes No
Salesforce Yes No No No
SAP ECC Yes No Yes* No
SAP S/4HANA Yes No Yes* No

* Besides the lineage on assets within the data source, lineage is also supported if dataset is used as a source/sink in Data Factory or Synapse pipeline.

Note

Currently, the Microsoft Purview Data Map can't scan an asset that has /, \, or # in its name. To scope your scan and avoid scanning assets that have those characters in the asset name, use the example in Register and scan an Azure SQL Database.

Scan regions

The following is a list of all the Azure data source (data center) regions where the Microsoft Purview Data Map scanner runs. If your Azure data source is in a region outside of this list, the scanner will run in the region of your Microsoft Purview instance.

Microsoft Purview Data Map scanner regions

  • Australia East
  • Australia Southeast
  • Brazil South
  • Canada Central
  • Central India
  • Central US
  • East Asia
  • East US
  • East US 2
  • France Central
  • Japan East
  • Korea Central
  • North Central US
  • North Europe
  • South Africa North
  • South Central US
  • Southeast Asia
  • UAE North
  • UK South
  • West Central US
  • West Europe
  • West US
  • West US 2

File types supported for scanning

The following file types are supported for scanning, for schema extraction, and classification where applicable:

  • Structured file formats supported by extension: AVRO, ORC, PARQUET, CSV, JSON, PSV, SSV, TSV, TXT, XML, GZIP

Note

  • The Microsoft Purview Data Map scanner only supports schema extraction for the structured file types listed above.
  • For AVRO, ORC, and PARQUET file types, the scanner does not support schema extraction for files that contain complex data types (for example, MAP, LIST, STRUCT).
  • The scanner supports scanning snappy compressed PARQUET types for schema extraction and classification.
  • For GZIP file types, the GZIP must be mapped to a single csv file within. Gzip files are subject to System and Custom Classification rules. We currently don't support scanning a gzip file mapped to multiple files within, or any file type other than csv.
  • For delimited file types (CSV, PSV, SSV, TSV, TXT), we do not support data type detection. The data type will be listed as "string" for all columns.
  • Document file formats supported by extension: DOC, DOCM, DOCX, DOT, ODP, ODS, ODT, PDF, POT, PPS, PPSX, PPT, PPTM, PPTX, XLC, XLS, XLSB, XLSM, XLSX, XLT
  • The Microsoft Purview Data Map also supports custom file extensions and custom parsers.

Nested data

Currently, nested data is only supported for JSON content.

For all system supported file types, if there's nested JSON content in a column, then the scanner parses the nested JSON data and surfaces it within the schema tab of the asset.

Nested data, or nested schema parsing, isn't supported in SQL. A column with nested data will be reported and classified as is, and subdata won't be parsed.

Sampling within a file

In Microsoft Purview Data Map terminology,

  • L1 scan: Extracts basic information and meta data like file name, size and fully qualified name
  • L2 scan: Extracts schema for structured file types and database tables
  • L3 scan: Extracts schema where applicable and subjects the sampled file to system and custom classification rules

For all structured file formats, the Microsoft Purview Data Map scanner samples files in the following way:

  • For structured file types, it samples the top 128 rows in each column or the first 1 MB, whichever is lower.
  • For document file formats, it samples the first 20 MB of each file.
    • If a document file is larger than 20 MB, then it isn't subject to a deep scan (subject to classification). In that case, Microsoft Purview captures only basic meta data like file name and fully qualified name.
  • For tabular data sources (SQL), it samples the top 128 rows.
  • For Azure Cosmos DB (SQL API), up to 300 distinct properties from the first 10 documents in a container will be collected for the schema and for each property, values from up to 128 documents or the first 1 MB will be sampled.

Resource set file sampling

A folder or group of partition files is detected as a resource set in the Microsoft Purview Data Map if it matches with a system resource set policy or a customer defined resource set policy. If a resource set is detected, then the scanner will sample each folder that it contains. Learn more about resource sets here.

File sampling for resource sets by file types:

  • Delimited files (CSV, PSV, SSV, TSV) - 1 in 100 files are sampled (L3 scan) within a folder or group of partition files that are considered a 'Resource set'
  • Data Lake file types (Parquet, Avro, Orc) - 1 in 18446744073709551615 (long max) files are sampled (L3 scan) within a folder or group of partition files that are considered a 'Resource set'
  • Other structured file types (JSON, XML, TXT) - 1 in 100 files are sampled (L3 scan) within a folder or group of partition files that are considered a 'Resource set'
  • SQL objects and CosmosDB entities - Each file is L3 scanned.
  • Document file types - Each file is L3 scanned. Resource set patterns don't apply to these file types.

Classification

All 208 system classification rules apply to structured file formats. Only the MCE classification rules apply to document file types (Not the data scan native regex patterns, bloom filter-based detection). For more information on supported classifications, see Supported classifications in the Microsoft Purview Data Map.

Next steps