Azure 範疇資料目錄用戶端中的資料歷程Data lineage in Azure Purview Data Catalog client

本文概要說明 Azure 範疇資料目錄中的資料歷程。This article provides an overview of data lineage in Azure Purview Data Catalog. 此外,它也會詳細說明資料系統如何與目錄整合,以抓取資料歷程。It also details how data systems can integrate with the catalog to capture lineage of data. 範疇可以針對組織資料資產的不同部分以及不同的準備層級,取得資料的歷程記錄,包括:Purview can capture lineage for data in different parts of your organization's data estate, and at different levels of preparation including:

  • 從各種平臺暫存的完全原始資料Completely raw data staged from various platforms
  • 轉換和準備的資料Transformed and prepared data
  • 視覺效果平臺所使用的資料。Data used by visualization platforms.

使用案例Use Cases

資料歷程可廣泛理解為跨越資料來源的生命週期,以及跨資料資產移動一段時間的位置。Data lineage is broadly understood as the lifecycle that spans the data’s origin, and where it moves over time across the data estate. 它適用于不同類型的回溯搜尋案例,例如疑難排解、追蹤資料管線中的根本原因和偵錯工具。It is used for different kinds of backwards-looking scenarios such as troubleshooting, tracing root cause in data pipelines and debugging. 歷程也可用於資料品質分析、合規性和「假設」案例通常稱為「影響分析」。Lineage is also used for data quality analysis, compliance and “what if” scenarios often referred to as impact analysis. 歷程會以視覺化方式呈現,以顯示從來源移至目的地的資料,包括資料轉換的方式。Lineage is represented visually to show data moving from source to destination including how the data was transformed. 由於大部分的企業資料環境都有複雜的情況,因此,不需要對周邊資料點進行某些合併或遮罩,就很難瞭解這些觀點。Given the complexity of most enterprise data environments, these views can be hard to understand without doing some consolidation or masking of peripheral data points.

Azure 範疇資料目錄中的歷程體驗Lineage experience in Azure Purview Data Catalog

範疇資料目錄會與其他資料處理、儲存體和分析系統連線,以解壓縮歷程資訊。Purview Data Catalog will connect with other data processing, storage, and analytics systems to extract lineage information. 這項資訊會結合以表示目錄中的一般案例特定歷程體驗。The information is combined to represent a generic, scenario-specific lineage experience in the Catalog.

結束歷程顯示從 blob 存放區複製的資料,一直到 Power BI 的儀表板

您的資料資產可能包含執行資料解壓縮的系統、轉換 (ETL/ELT 系統) 、分析和視覺化系統。Your data estate may include systems doing data extraction, transformation (ETL/ELT systems), analytics, and visualization systems. 每個系統都會捕獲豐富的靜態和操作中繼資料,以描述系統界限內的資料狀態和品質。Each of the systems captures rich static and operational metadata that describes the state and quality of the data within the systems boundary. 資料目錄中的歷程的目標是要盡可能以最低的速度從每個資料系統中提取移動、轉換和操作中繼資料。The goal of lineage in a data catalog is to extract the movement, transformation, and operational metadata from each data system at the lowest grain possible.

下列範例是在多個系統之間移動資料的一般使用案例,其中資料目錄會連接到每個系統進行歷程。The following example is a typical use case of data moving across multiple systems, where the Data Catalog would connect to each of the systems for lineage.

  • Data Factory 會將資料從內部內部部署/原始區域複製到雲端中的登陸區域。Data Factory copies data from on-prem/raw zone to a landing zone in the cloud.
  • 資料處理系統(例如 Synapse、Databricks)會使用筆記本處理和轉換來自登陸區域到策劃區域的資料。Data processing systems like Synapse, Databricks would process and transform data from landing zone to Curated zone using notebooks.
  • 進一步處理分析模型中的資料,以獲得最佳查詢效能和匯總。Further processing of data into analytical models for optimal query performance and aggregation.
  • 資料視覺效果系統會透過其中繼模型取用資料集和處理常式,以建立 BI 儀表板、ML 實驗等等。Data visualization systems will consume the datasets and process through their meta model to create a BI Dashboard, ML experiments and so on.

歷程資料細微性Lineage granularity

本節涵蓋資料目錄收集歷程資訊的資料細微性詳細資料。This section covers the details about the granularity of which the lineage information is gathered by a data catalog. 這種資料細微性可能會根據目前的資料系統而有所不同。This granularity can vary based on the data systems which are being.

實體層級歷程:來源 (s) > 進程 > 目標 (s) Entity level lineage: Source(s) > Process > Target(s)

  • 歷程會以圖形表示,通常它會在由計算系統叫用的進程所連接的資料儲存系統中包含來源和目標實體。Lineage is represented as a graph, typically it contains source and target entities in Data storage systems that are connected by a process invoked by a compute system.
  • 資料系統會連接到資料目錄,以產生並報告參考基礎資料系統實體物件的唯一物件,例如: SQL 預存程式、筆記本等等。Data systems connect to the data catalog to generate and report a unique object referencing the physical object of the underlying data system for example: SQL Stored procedure, notebooks, and so on.
  • 使用額外的中繼資料(例如擁有權)進行高精確度歷程,以針對來源 & 目標實體以人類可讀取的格式顯示歷程。High fidelity lineage with additional metadata like ownership is captured to show the lineage in a human readable format for source & target entities. 例如:在 hive 資料表層級(而非資料分割或檔案層級)的歷程。for example: lineage at a hive table level instead of partitions or file level.

資料行或屬性層級歷程Column or attribute level lineage

識別來源實體的屬性 (s) ,此實體是用來建立或衍生目標實體中) 的屬性 (s。Identify attribute(s) of a source entity that is used to create or derive attribute(s) in the target entity. 來源屬性的名稱可以在目標中保留或重新命名。The name of the source attribute could be retained or renamed in a target. ADF 這類系統可從內部部署環境對雲端進行一對一複製。Systems like ADF can do a one-one copy from on-premises environment to the cloud. 例如:Table1/ColumnA -> Table2/ColumnAFor example: Table1/ColumnA -> Table2/ColumnA.

處理常式執行狀態Process execution status

為了支援根本原因分析和資料品質案例,我們會在資料處理系統中捕捉作業的執行狀態。To support root cause analysis and data quality scenarios, we capture the execution status of the jobs in data processing systems. 這項需求與取代其他資料處理系統的監視功能沒有任何關係,目標是取代它們。This requirement has nothing to do with replacing the monitoring capabilities of other data processing systems, neither the goal is to replace them.

總結Summary

歷程是範疇資料目錄的重要功能,可支援品質、信任和審核案例。Lineage is a critical feature of the Purview Data Catalog to support quality, trust, and audit scenarios. 資料目錄的目標是要建立一個強大的架構,讓環境中的所有資料系統都能自然地連接和報告歷程。The goal of a data catalog is to build a robust framework where all the data systems within your environment can naturally connect and report lineage. 中繼資料可供使用之後,資料目錄可以將資料系統所提供的中繼資料整合在一起,以支援資料治理使用案例。Once the metadata is available, the data catalog can bring together the metadata provided by data systems to power data governance use cases.

下一步Next steps