Azure Data Factory 簡介Introduction to Azure Data Factory

在巨量資料的世界中,未經處理、未經組織的資料通常是儲存在關聯式、非關聯式及其他儲存體系統中。In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage systems. 不過,未經處理資料本身並沒有適當的內容或意義,因此無法為分析人員、資料科學人員或業務決策者提供有意義的深入解析。However, on its own, raw data doesn't have the proper context or meaning to provide meaningful insights to analysts, data scientists, or business decision makers.

巨量資料需要以下服務:可協調和運作程序,將這些龐大的未經處理資料存放區精簡成可操作的業務見解。Big data requires service that can orchestrate and operationalize processes to refine these enormous stores of raw data into actionable business insights. Azure Data Factory 是一個針對這些複雜的混合式「擷取、轉換和載入」(ELT)、「擷取、載入和轉換」(ELT) 及資料整合專案建立的受控雲端服務。Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.

例如,想像有一個收集雲端遊戲所產生之數 PB 遊戲記錄的遊戲公司。For example, imagine a gaming company that collects petabytes of game logs that are produced by games in the cloud. 該公司想要分析這些記錄,深入了解客戶喜好設定、人口統計和使用行為。The company wants to analyze these logs to gain insights into customer preferences, demographics, and usage behavior. 這家公司也想要識別向上銷售和交叉銷售機會,開發強大的新功能來推動業務成長,以及為客戶提供更好的體驗。It also wants to identify up-sell and cross-sell opportunities, develop compelling new features, drive business growth, and provide a better experience to its customers.

為了分析這些記錄,此公司必須使用參考資料,例如內部部署資料存放區中的客戶資訊、遊戲資訊及行銷活動資訊。To analyze these logs, the company needs to use reference data such as customer information, game information, and marketing campaign information that is in an on-premises data store. 此公司想要利用這份來自內部部署資料存放區的資料,將其與雲端資料存放區中的額外記錄資料結合。The company wants to utilize this data from the on-premises data store, combining it with additional log data that it has in a cloud data store.

為了擷取深入解析,它希望使用雲端中的 Spark 叢集 (Azure HDInsight) 來處理聯結的資料,然後將轉換後的資料發佈到雲端資料倉儲 (例如 Azure SQL 資料倉儲),來輕鬆地以該資料建立報告。To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud (Azure HDInsight), and publish the transformed data into a cloud data warehouse such as Azure SQL Data Warehouse to easily build a report on top of it. 他們想要自動執行此工作流程,並且每日按照排程監視和管理此工作流程。They want to automate this workflow, and monitor and manage it on a daily schedule. 他們也想要在檔案進入 Blob 存放區容器時執行該工作流程。They also want to execute it when files land in a blob store container.

Azure Data Factory 就是解決這類資料案例的平台。Azure Data Factory is the platform that solves such data scenarios. 這是一項雲端式資料整合服務,可讓您在雲端建立資料驅動工作流程,以便協調及自動進行資料移動和資料轉換It is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. 使用 Azure Data Factory,可以建立並排程資料驅動的工作流程 (稱為管線),它可以從不同的資料存放區擷取資料。Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. 使用計算服務 (例如,Azure HDInsight Hadoop、Spark、Azure Data Lake Analytics 和 Azure Machine Learning) 可以處理或轉換資料。It can process and transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.

此外,您可以將輸出資料發佈至資料存放區,例如 Azure SQL 資料倉儲,讓商業智慧 (BI) 應用程式取用。Additionally, you can publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI) applications to consume. 最後,透過 Azure Data Factory,即可將未經處理資料組織到有意義的資料存放區和資料湖中,以供做出更好的業務決策。Ultimately, through Azure Data Factory, raw data can be organized into meaningful data stores and data lakes for better business decisions.

Data Factory 的最上層檢視

運作方式How does it work?

Azure Data Factory 中的管線 (資料導向工作流程) 通常會執行下列四個步驟︰The pipelines (data-driven workflows) in Azure Data Factory typically perform the following four steps:

資料導向工作流程的四個步驟

連線及收集Connect and collect

企業有位於內部部署環境和雲端中截然不同來源的各種類型資料 (結構化、非結構化及半結構化),全都以不同的間隔和速度抵達。Enterprises have data of various types that are located in disparate sources on-premises, in the cloud, structured, unstructured, and semi-structured, all arriving at different intervals and speeds.

建置資訊生產系統的第一步,就是連線到所有必要的資料和處理來源 (例如軟體即服務 (SaaS) 服務、資料庫、檔案共用和 FTP Web 服務)。The first step in building an information production system is to connect to all the required sources of data and processing, such as software-as-a-service (SaaS) services, databases, file shares, and FTP web services. 下一個步驟為視需要將資料移至集中式位置進行後續處理。The next step is to move the data as needed to a centralized location for subsequent processing.

沒有 Data Factory,企業必須建置自訂的資料移動元件或撰寫自訂服務,以整合這些資料來源和處理。Without Data Factory, enterprises must build custom data movement components or write custom services to integrate these data sources and processing. 整合和維護這類系統相當耗費成本而且困難。It's expensive and hard to integrate and maintain such systems. 此外,這些系統經常會缺少企業等級監視、警示與完全受控服務可以提供的控制項。In addition, they often lack the enterprise-grade monitoring, alerting, and the controls that a fully managed service can offer.

有了 Data Factory,您就可以使用資料管線中的複製活動,將內部部署和雲端來源資料存放區內的資料都移到雲端中的集中資料存放區,以供進一步分析。With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and cloud source data stores to a centralization data store in the cloud for further analysis. 例如,您可以收集 Azure Data Lake Store 中的資料,之後使用 Azure Data Lake Analytics 計算服務來轉換資料。For example, you can collect data in Azure Data Lake Store and transform the data later by using an Azure Data Lake Analytics compute service. 您也可以收集 Azure Blob 儲存體中的資料,之後使用 Azure HDInsight Hadoop 叢集來轉換資料。You can also collect data in Azure Blob storage and transform it later by using an Azure HDInsight Hadoop cluster.

轉換及擴充Transform and enrich

在資料存在於雲端的集中式資料存放區之後,請使用計算服務 (例如 HDInsight Hadoop、Spark、Data Lake Analytics 和 Machine Learning) 來處理或轉換所收集的資料。After data is present in a centralized data store in the cloud, process or transform the collected data by using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning. 您想要在可維護且可控制的排程中可靠地產生轉換的資料,以將信任的資料饋送至生產環境。You want to reliably produce transformed data on a maintainable and controlled schedule to feed production environments with trusted data.

發佈Publish

在未經處理資料已精簡成符合業務需求的可取用形式之後,將該資料載入到 Azure 資料倉儲、Azure SQL Database、Azure CosmosDB,或業務使用者可從其商業智慧工具指向的任何分析引擎。After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse, Azure SQL Database, Azure CosmosDB, or whichever analytics engine your business users can point to from their business intelligence tools.

監視Monitor

在您順利建置並部署資料整合管線之後 (從精簡資料提供業務價值),請監視所排定活動和管線的成功和失敗率。After you have successfully built and deployed your data integration pipeline, providing business value from refined data, monitor the scheduled activities and pipelines for success and failure rates. Azure Data Factory 提供內建支援,可讓您透過 Azure 監視器、API、PowerShell、Azure 監視器記錄及 Azure 入口網站上的健康情況面板監視管線。Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.

最上層概念Top-level concepts

Azure 訂用帳戶可能會有一或多個 Azure Data Factory 執行個體 (或資料處理站)。An Azure subscription might have one or more Azure Data Factory instances (or data factories). Azure Data Factory 是由四個主要元件所組成。Azure Data Factory is composed of four key components. 這些元件會一起運作,以提供平台讓您撰寫具有資料移動和轉換步驟的資料驅動工作流程。These components work together to provide the platform on which you can compose data-driven workflows with steps to move and transform data.

管線Pipeline

資料處理站可以有一或多個管線。A data factory might have one or more pipelines. 管線是一個執行某個單位工作的活動邏輯群組。A pipeline is a logical grouping of activities that performs a unit of work. 管線中的活動會合作執行一項工作。Together, the activities in a pipeline perform a task. 例如,管線可以包含一組活動,以從 Azure Blob 內嵌資料,然後對 HDInsight 叢集執行 Hive 查詢來分割資料。For example, a pipeline can contain a group of activities that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to partition the data.

這麼做的好處是,您可以將這些活動作為一個集合來進行管線,而不是個別管理每個活動。The benefit of this is that the pipeline allows you to manage the activities as a set instead of managing each one individually. 管線中的活動可以鏈結在一起以循序方式運作,或是以平行方式獨立運作。The activities in a pipeline can be chained together to operate sequentially, or they can operate independently in parallel.

活動Activity

活動代表管線中的處理步驟。Activities represent a processing step in a pipeline. 例如,您可以使用複製活動將資料從某個資料存放區複製到另一個資料存放區。For example, you might use a copy activity to copy data from one data store to another data store. 同樣地,您可以使用在 Azure HDInsight 叢集上執行 Hive 查詢的 Hive 活動,來轉換或分析您的資料。Similarly, you might use a Hive activity, which runs a Hive query on an Azure HDInsight cluster, to transform or analyze your data. Data Factory 支援三種類型的活動︰資料移動活動、資料轉換活動,以及控制活動。Data Factory supports three types of activities: data movement activities, data transformation activities, and control activities.

資料集Datasets

資料集代表資料存放區中的資料結構,其只會指出或參考您要在活動中作為輸入或輸出的資料。Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.

連結的服務Linked services

已連結的服務非常類似連接字串,可定義 Data Factory 連接到外部資源所需的連線資訊。Linked services are much like connection strings, which define the connection information that's needed for Data Factory to connect to external resources. 這麼說吧:連結的服務會定義與資料來源的連線,而資料集代表資料的結構。Think of it this way: a linked service defines the connection to the data source, and a dataset represents the structure of the data. 例如,Azure 儲存體連結的服務會指定連接字串以連線到 Azure 儲存體帳戶。For example, an Azure Storage-linked service specifies a connection string to connect to the Azure Storage account. 此外,Azure Blob 資料集會指定包含資料的 Blob 容器和資料夾。Additionally, an Azure blob dataset specifies the blob container and the folder that contains the data.

Data Factory 中的連結服務,有兩個用途:Linked services are used for two purposes in Data Factory:

  • 用來代表資料存放區,其中包含 (但不限於) 內部部署 SQL Server 資料庫、Oracle 資料庫、檔案共用或 Azure Blob 儲存體帳戶。To represent a data store that includes, but isn't limited to, an on-premises SQL Server database, Oracle database, file share, or Azure blob storage account. 如需所支援資料存放區的清單,請參閱複製活動一文。For a list of supported data stores, see the copy activity article.

  • 用來代表可裝載活動執行的 計算資源To represent a compute resource that can host the execution of an activity. 例如,HDInsightHive 活動會在 HDInsight Hadoop 叢集上執行。For example, the HDInsightHive activity runs on an HDInsight Hadoop cluster. 如需轉換活動和所支援計算環境的清單,請參閱轉換資料一文。For a list of transformation activities and supported compute environments, see the transform data article.

觸發程序Triggers

觸發程序是一種處理單位,可用來決定何時需要開始執行管線。Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off. 針對不同類型的事件,有不同類型的觸發程序。There are different types of triggers for different types of events.

管線執行Pipeline runs

管線執行是執行管線的執行個體。A pipeline run is an instance of the pipeline execution. 「管線執行」通常是藉由將引數傳遞給管線中定義的參數來具現化。Pipeline runs are typically instantiated by passing the arguments to the parameters that are defined in pipelines. 傳遞引數時,可以藉由手動方式傳遞,或在觸發程序定義內傳遞。The arguments can be passed manually or within the trigger definition.

參數Parameters

參數是唯讀設定的鍵值組。Parameters are key-value pairs of read-only configuration.  參數定義於管線中。  Parameters are defined in the pipeline. 傳遞所定義參數的引數時,則是在從觸發程序或手動執行之管線所建立的執行內容中執行時傳遞。The arguments for the defined parameters are passed during execution from the run context that was created by a trigger or a pipeline that was executed manually. 管線內的活動會取用參數值。Activities within the pipeline consume the parameter values.

資料集是一種強型別參數,也是可重複使用/可參考的實體。A dataset is a strongly typed parameter and a reusable/referenceable entity. 活動可以參考資料集,而且可以取用資料集定義中所定義的屬性。An activity can reference datasets and can consume the properties that are defined in the dataset definition.

連結服務也是一種強型別參數,其中包含資料存放區或計算環境的連線資訊。A linked service is also a strongly typed parameter that contains the connection information to either a data store or a compute environment. 它也是可重複使用/可參考的實體。It is also a reusable/referenceable entity.

控制流程Control flow

控制流程是管線活動的協調流程,其中包括將活動循序鏈結、分支、在管線層級定義參數,以及在依需求或從觸發程序叫用管線時傳遞引數。Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or from a trigger. 此外,也包括自訂狀態傳遞和迴圈容器,亦即 For-each 列舉程式。It also includes custom-state passing and looping containers, that is, For-each iterators.

如需 Data Factory 概念的詳細資訊,請參閱下列文章:For more information about Data Factory concepts, see the following articles:

支援區域Supported regions

如需目前可使用 Data Factory 的 Azure 區域清單,請在下列頁面上選取您感興趣的區域,然後展開 [分析] 以找出 [Data Factory]:依區域提供的產品For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on the following page, and then expand Analytics to locate Data Factory: Products available by region. 不過,Data Factory 可以存取其他 Azure 區域的資料存放區和計算資料,以在資料存放區之間移動資料或使用計算服務處理資料。However, a data factory can access data stores and compute services in other Azure regions to move data between data stores or process data using compute services.

Azure Data Factory 本身不會儲存任何資料。Azure Data Factory itself does not store any data. 它可讓您建立資料導向工作流程,藉由使用其他區域或內部部署環境中的計算服務,協調所支援資料存放區之間的資料移動和資料處理。It lets you create data-driven workflows to orchestrate the movement of data between supported data stores and the processing of data using compute services in other regions or in an on-premises environment. 它也可讓您使用程式設計方式和 UI 機制來監視和管理工作流程。It also allows you to monitor and manage workflows by using both programmatic and UI mechanisms.

雖然只有特定區域有提供 Data Factory,但為 Data Factory 中的資料移動提供技術支援的服務卻是全球數個區域中都有提供。Although Data Factory is available only in certain regions, the service that powers the data movement in Data Factory is available globally in several regions. 如果資料存放區位於防火牆後面,就會改由內部部署環境中所安裝的自我裝載 Integration Runtime 負責移動資料。If a data store is behind a firewall, then a Self-hosted Integration Runtime that's installed in your on-premises environment moves the data instead.

如需範例,讓我們假設您的計算環境 (例如 Azure HDInsight 叢集和 Azure 機器學習服務) 即將用盡西歐區域的資源。For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure Machine Learning are running out of the West Europe region. 您可以在美國東部或美國東部 2 建立和利用 Azure Data Factory 執行個體,並用它來排程您在西歐計算環境的作業。You can create and use an Azure Data Factory instance in East US or East US 2 and use it to schedule jobs on your compute environments in West Europe. 只要幾毫秒的時間,Data Factory 就能觸發計算環境上的作業,但執行計算環境上作業所需的時間則不會改變。It takes a few milliseconds for Data Factory to trigger the job on your compute environment, but the time for running the job on your computing environment does not change.

協助工具Accessibility

可存取 Azure 入口網站中的 Data Factory 使用者體驗。The Data Factory user experience in the Azure portal is accessible.

與第 1 版比較Compare with version 1

如需 Data Factory 服務第 1 版與目前版本的差異清單,請參閱與第 1 版比較For a list of differences between version 1 and the current version of the Data Factory service, see Compare with version 1.

後續步驟Next steps

使用下列其中一個工具/SDK,開始建立 Data Factory 管線:Get started with creating a Data Factory pipeline by using one of the following tools/SDKs: