Introduction to Azure Data Factory Service, a data integration service in the cloud

What is Azure Data Factory?

Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. You can create data integration solutions using the Data Factory service that can ingest data from various data stores, transform/process the data, and publish result data to the data stores.

Data Factory service allows you to create data pipelines that move and transform data, and then run the pipelines on a specified schedule (hourly, daily, weekly, etc.). It also provides rich visualizations to display the lineage and dependencies between your data pipelines, and monitor the pipelines from a single unified view to easily pinpoint issues and setup monitoring alerts.

Diagram: Data Factory Overview, a data integration service

Figure1. Ingest data from various data sources, prepare, transform, and analyze the data, and then publish ready-to-use data for consumption.

Pipelines and activities

In a Data Factory solution, you create one or more data pipelines. A pipeline is a logical grouping of activities. They are used to group activities into a unit that together perform a task.

Activities define the actions to perform on your data. For example, you may use a Copy activity to copy data from one data store to another data store. Similarly, you may use a Hive activity, which runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data Factory supports two types of activities: data movement activities and data transformation activities.

Data movement activities

Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy data to and from that store.

Category Data store Supported as a source Supported as a sink
Azure Azure Blob storage
Azure Data Lake Store
Azure SQL Database
Azure SQL Data Warehouse
Azure Table storage
Azure DocumentDB
Azure Search Index











Databases SQL Server*
Oracle*
MySQL*
DB2*
Teradata*
PostgreSQL*
Sybase*
Cassandra*
MongoDB*
Amazon Redshift











 
 
 
 
 
 
 
 
File File System*
HDFS*
Amazon S3
FTP




 
 
Others Salesforce
Generic ODBC*
Generic OData
Web Table (table from HTML)
GE Historian*




 
 
 
 
 
 
Note

Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway on an on-premises/Azure IaaS machine.

For more information, see Data Movement Activities article.

Data transformation activities

Azure Data Factory supports the following transformation activities that can be added to pipelines either individually or chained with another activity.

Data transformation activity Compute environment
Hive HDInsight [Hadoop]
Pig HDInsight [Hadoop]
MapReduce HDInsight [Hadoop]
Hadoop Streaming HDInsight [Hadoop]
Machine Learning activities: Batch Execution and Update Resource Azure VM
Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server
Data Lake Analytics U-SQL Azure Data Lake Analytics
DotNet HDInsight [Hadoop] or Azure Batch
Note

You can use MapReduce activity to run Spark programs on your HDInsight Spark cluster. See Invoke Spark programs from Azure Data Factory for details. You can create a custom activity to run R scripts on your HDInsight cluster with R installed. See Run R Script using Azure Data Factory.

For more information, see Data Transformation Activities article.

If you need to move data to/from a data store that Copy Activity doesn't support, or transform data using your own logic, create a custom .NET activity. For details on creating and using a custom activity, see Use custom activities in an Azure Data Factory pipeline.

Linked services

Linked services define the information needed for Data Factory to connect to external resources (Examples: Azure Storage, on-premises SQL Server, Azure HDInsight). Linked services are used for two purposes in Data Factory:

  • To represent a data store including, but not limited to, an on-premises SQL Server, Oracle database, file share, or an Azure Blob Storage account. See the Data movement activities section for a list of supported data stores.
  • To represent a compute resource that can host the execution of an activity. For example, the HDInsightHive activity runs on an HDInsight Hadoop cluster. See Data transformation activities section for a list of supported compute environments.

Datasets

Linked services link data stores to an Azure data factory. Datasets represent data structures with in the data stores. For example, an Azure Storage linked service provides connection information for Data Factory to connect to an Azure Storage account. An Azure Blob dataset specifies the blob container and folder in the Azure Blob Storage from which the pipeline should read the data. Similarly, an Azure SQL linked service provides connection information for an Azure SQL database and an Azure SQL dataset specifies the table that contains the data.

Relationship between Data Factory entities

Data Factory has a few key entities that work together to define input and output data, processing events, and the schedule and resources required to execute the desired data flow.

Diagram: Data Factory, a cloud data integration service - Key Concepts Figure 2. Relationships between Dataset, Activity, Pipeline, and Linked service

With the four simple concepts of linked services, datasets, activities, and pipelines, you are ready to get started! You can build your first pipeline.

Supported regions

Currently, you can create data factories in the West US, East US, and North Europe regions. However, a data factory can access data stores and compute services in other Azure regions to move data between data stores or process data using compute services.

Azure Data Factory itself does not store any data. It lets you create data-driven flows to orchestrate movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment. It also allows you to monitor and manage workflows using both programmatic and UI mechanisms.

Even though Data Factory is available in only West US, East US, and North Europe regions, the service powering the data movement in Data Factory is available globally in several regions. In case a data store sits behind a firewall then a Data Management Gateway installed in your on-premises environment moves the data instead.

For an example, let us assume that your compute environments such as Azure HDInsight cluster and Azure Machine Learning are running out of West Europe region. You can create and use an Azure Data Factory instance in North Europe and use it to schedule jobs on your compute environments in West Europe. It takes a few milliseconds for Data Factory to trigger the job on your compute environment but the time for running the job on your computing environment does not change.

We intend to have Azure Data Factory in every geography supported by Azure in the future.

Next steps

To learn how to build data factories with data pipelines, follow step-by-step instructions in the following tutorials:

Tutorial Description
Move data between two cloud data stores In this tutorial, you create a data factory with a pipeline that moves data from Blob storage to SQL database.
Transform data using Hadoop cluster In this tutorial, you build your first Azure data factory with a data pipeline that processes data by running Hive script on an Azure HDInsight (Hadoop) cluster.
Move data between an on-premises data store and a cloud data store using Data Management Gateway In this tutorial, you build a data factory with a pipeline that moves data from an on-premises SQL Server database to an Azure blob. As part of the walkthrough, you install and configure the Data Management Gateway on your machine.