Azure Data Factory FAQ

APPLIES TO: Azure Data Factory Azure Synapse Analytics

This article provides answers to frequently asked questions about Azure Data Factory.

What is Azure Data Factory?

Data Factory is a fully managed, cloud-based, data-integration ETL service that automates the movement and transformation of data. Like a factory that runs equipment to transform raw materials into finished goods, Azure Data Factory orchestrates existing services that collect raw data and transform it into ready-to-use information.

By using Azure Data Factory, you can create data-driven workflows to move data between on-premises and cloud data stores. And you can process and transform data with Data Flows. ADF also supports external compute engines for hand-coded transformations by using compute services such as Azure HDInsight, Azure Databricks, and the SQL Server Integration Services (SSIS) integration runtime.

With Data Factory, you can execute your data processing either on an Azure-based cloud service or in your own self-hosted compute environment, such as SSIS, SQL Server, or Oracle. After you create a pipeline that performs the action you need, you can schedule it to run periodically (hourly, daily, or weekly, for example), time window scheduling, or trigger the pipeline from an event occurrence. For more information, see Introduction to Azure Data Factory.

Compliance and Security Considerations

Azure Data Factory is certified for a range of compliance certifications, including SOC 1, 2, 3, HIPAA BAA, and HITRUST. Full and growing list of certifications can be found here. Digital copies for audit reports and compliance certifications can be found in Service Trust Center

Control flows and scale

To support the diverse integration flows and patterns in the modern data warehouse, Data Factory enables flexible data pipeline modeling. This entails full control flow programming paradigms, which include conditional execution, branching in data pipelines, and the ability to explicitly pass parameters within and across these flows. Control flow also encompasses transforming data through activity dispatch to external execution engines and data flow capabilities, including data movement at scale, via the Copy activity.

Data Factory provides freedom to model any flow style that's required for data integration and that can be dispatched on demand or repeatedly on a schedule. A few common flows that this model enables are:

  • Control flows:
    • Activities can be chained together in a sequence within a pipeline.
    • Activities can be branched within a pipeline.
    • Parameters:
      • Parameters can be defined at the pipeline level and arguments can be passed while you invoke the pipeline on demand or from a trigger.
      • Activities can consume the arguments that are passed to the pipeline.
    • Custom state passing:
      • Activity outputs, including state, can be consumed by a subsequent activity in the pipeline.
    • Looping containers:
      • The foreach activity will iterate over a specified collection of activities in a loop.
  • Trigger-based flows:
    • Pipelines can be triggered on demand, by wall-clock time, or in response to driven by event grid topics
  • Delta flows:
    • Parameters can be used to define your high-water mark for delta copy while moving dimension or reference tables from a relational store, either on-premises or in the cloud, to load the data into the lake.

For more information, see Tutorial: Control flows.

Data transformed at scale with code-free pipelines

The new browser-based tooling experience provides code-free pipeline authoring and deployment with a modern, interactive web-based experience.

For visual data developers and data engineers, the Data Factory web UI is the code-free design environment that you will use to build pipelines. It's fully integrated with Visual Studio Online Git and provides integration for CI/CD and iterative development with debugging options.

Rich cross-platform SDKs for advanced users

Data Factory V2 provides a rich set of SDKs that can be used to author, manage, and monitor pipelines by using your favorite IDE, including:

  • Python SDK
  • PowerShell CLI
  • C# SDK

Users can also use the documented REST APIs to interface with Data Factory V2.

Iterative development and debugging by using visual tools

Azure Data Factory visual tools enable iterative development and debugging. You can create your pipelines and do test runs by using the Debug capability in the pipeline canvas without writing a single line of code. You can view the results of your test runs in the Output window of your pipeline canvas. After your test run succeeds, you can add more activities to your pipeline and continue debugging in an iterative manner. You can also cancel your test runs after they are in progress.

You are not required to publish your changes to the data factory service before selecting Debug. This is helpful in scenarios where you want to make sure that the new additions or changes will work as expected before you update your data factory workflows in development, test, or production environments.

Ability to deploy SSIS packages to Azure

If you want to move your SSIS workloads, you can create a Data Factory and provision an Azure-SSIS integration runtime. An Azure-SSIS integration runtime is a fully managed cluster of Azure VMs (nodes) that are dedicated to run your SSIS packages in the cloud. For step-by-step instructions, see the Deploy SSIS packages to Azure tutorial.

SDKs

If you are an advanced user and looking for a programmatic interface, Data Factory provides a rich set of SDKs that you can use to author, manage, or monitor pipelines by using your favorite IDE. Language support includes .NET, PowerShell, Python, and REST.

Monitoring

You can monitor your Data Factories via PowerShell, SDK, or the Visual Monitoring Tools in the browser user interface. You can monitor and manage on-demand, trigger-based, and clock-driven custom flows in an efficient and effective manner. Cancel existing tasks, see failures at a glance, drill down to get detailed error messages, and debug the issues, all from a single pane of glass without context switching or navigating back and forth between screens.

New features for SSIS in Data Factory

Since the initial public preview release in 2017, Data Factory has added the following features for SSIS:

  • Support for three more configurations/variants of Azure SQL Database to host the SSIS database (SSISDB) of projects/packages:
  • SQL Database with virtual network service endpoints
  • SQL Managed Instance
  • Elastic pool
  • Support for an Azure Resource Manager virtual network on top of a classic virtual network to be deprecated in the future, which lets you inject/join your Azure-SSIS integration runtime to a virtual network configured for SQL Database with virtual network service endpoints/MI/on-premises data access. For more information, see also Join an Azure-SSIS integration runtime to a virtual network.
  • Support for Azure Active Directory (Azure AD) authentication and SQL authentication to connect to the SSISDB, allowing Azure AD authentication with your Data Factory managed identity for Azure resources
  • Support for bringing your existing SQL Server license to earn substantial cost savings from the Azure Hybrid Benefit option
  • Support for Enterprise Edition of the Azure-SSIS integration runtime that lets you use advanced/premium features, a custom setup interface to install additional components/extensions, and a partner ecosystem. For more information, see also Enterprise Edition, Custom Setup, and 3rd Party Extensibility for SSIS in ADF.
  • Deeper integration of SSIS in Data Factory that lets you invoke/trigger first-class Execute SSIS Package activities in Data Factory pipelines and schedule them via SSMS. For more information, see also Modernize and extend your ETL/ELT workflows with SSIS activities in ADF pipelines.

What is the integration runtime?

The integration runtime is the compute infrastructure that Azure Data Factory uses to provide the following data integration capabilities across various network environments:

  • Data movement: For data movement, the integration runtime moves the data between the source and destination data stores, while providing support for built-in connectors, format conversion, column mapping, and performant and scalable data transfer.
  • Data flow: For data flow, Execute a Data Flow in managed Azure compute environment.
  • Dispatch activities: For transformation, the integration runtime provides capability to natively execute SSIS packages.
  • Execute SSIS packages: The integration runtime natively executes SSIS packages in a managed Azure compute environment. The integration runtime also supports dispatching and monitoring transformation activities running on a variety of compute services, such as Azure HDInsight, Azure Machine Learning, SQL Database, and SQL Server.

You can deploy one or many instances of the integration runtime as required to move and transform data. The integration runtime can run on an Azure public network or on a private network (on-premises, Azure Virtual Network, or Amazon Web Services virtual private cloud [VPC]). In Data Factory, an activity defines the action to be performed. A linked service defines a target data store or a compute service. An integration runtime provides the bridge between the activity and linked Services. It's referenced by the linked service or activity, and provides the compute environment where the activity either runs on or gets dispatched from. This way, the activity can be performed in the region closest possible to the target data store or compute service in the most performant way while meeting security and compliance needs.

Integration runtimes can be created in the Azure Data Factory UX via the management hub and any activities, datasets, or data flows that reference them. For more information, see Integration runtime in Azure Data Factory.

What is the limit on the number of integration runtimes?

There is no hard limit on the number of integration runtime instances you can have in a data factory. There is, however, a limit on the number of VM cores that the integration runtime can use per subscription for SSIS package execution. For more information, see Data Factory limits.

What are the top-level concepts of Azure Data Factory?

An Azure subscription can have one or more Azure Data Factory instances (or data factories). Azure Data Factory contains four key components that work together as a platform on which you can compose data-driven workflows with steps to move and transform data.

Pipelines

A data factory can have one or more pipelines. A pipeline is a logical grouping of activities to perform a unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group of activities that ingest data from an Azure blob and then run a Hive query on an HDInsight cluster to partition the data. The benefit is that you can use a pipeline to manage the activities as a set instead of having to manage each activity individually. You can chain together the activities in a pipeline to operate them sequentially, or you can operate them independently, in parallel.

Data flows

Data flows are objects that you build visually in Data Factory which transform data at scale on backend Spark services. You do not need to understand programming or Spark internals. Just design your data transformation intent using graphs (Mapping) or spreadsheets (Power query activity).

Activities

Activities represent a processing step in a pipeline. For example, you can use a Copy activity to copy data from one data store to another data store. Similarly, you can use a Hive activity, which runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data Factory supports three types of activities: data movement activities, data transformation activities, and control activities.

Datasets

Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.

Linked services

Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. Think of it this way: A linked service defines the connection to the data source, and a dataset represents the structure of the data. For example, an Azure Storage linked service specifies the connection string to connect to the Azure Storage account. And an Azure blob dataset specifies the blob container and the folder that contains the data.

Linked services have two purposes in Data Factory:

  • To represent a data store that includes, but is not limited to, a SQL Server instance, an Oracle database instance, a file share, or an Azure Blob storage account. For a list of supported data stores, see Copy Activity in Azure Data Factory.
  • To represent a compute resource that can host the execution of an activity. For example, the HDInsight Hive activity runs on an HDInsight Hadoop cluster. For a list of transformation activities and supported compute environments, see Transform data in Azure Data Factory.

Triggers

Triggers represent units of processing that determine when a pipeline execution is kicked off. There are different types of triggers for different types of events.

Pipeline runs

A pipeline run is an instance of a pipeline execution. You usually instantiate a pipeline run by passing arguments to the parameters that are defined in the pipeline. You can pass the arguments manually or within the trigger definition.

Parameters

Parameters are key-value pairs in a read-only configuration. You define parameters in a pipeline, and you pass the arguments for the defined parameters during execution from a run context. The run context is created by a trigger or from a pipeline that you execute manually. Activities within the pipeline consume the parameter values.

A dataset is a strongly typed parameter and an entity that you can reuse or reference. An activity can reference datasets, and it can consume the properties that are defined in the dataset definition.

A linked service is also a strongly typed parameter that contains connection information to either a data store or a compute environment. It's also an entity that you can reuse or reference.

Control flows

Control flows orchestrate pipeline activities that include chaining activities in a sequence, branching, parameters that you define at the pipeline level, and arguments that you pass as you invoke the pipeline on demand or from a trigger. Control flows also include custom state passing and looping containers (that is, foreach iterators).

For more information about Data Factory concepts, see the following articles:

What is the pricing model for Data Factory?

For Azure Data Factory pricing details, see Data Factory pricing details.

How can I stay up-to-date with information about Data Factory?

For the most up-to-date information about Azure Data Factory, go to the following sites:

Technical deep dive

How can I schedule a pipeline?

You can use the scheduler trigger or time window trigger to schedule a pipeline. The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM). For more information, see Pipeline execution and triggers.

Can I pass parameters to a pipeline run?

Yes, parameters are a first-class, top-level concept in Data Factory. You can define parameters at the pipeline level and pass arguments as you execute the pipeline run on demand or by using a trigger.

Can I define default values for the pipeline parameters?

Yes. You can define default values for the parameters in the pipelines.

Can an activity in a pipeline consume arguments that are passed to a pipeline run?

Yes. Each activity within the pipeline can consume the parameter value that's passed to the pipeline and run with the @parameter construct.

Can an activity output property be consumed in another activity?

Yes. An activity output can be consumed in a subsequent activity with the @activity construct.

How do I gracefully handle null values in an activity output?

You can use the @coalesce construct in the expressions to handle null values gracefully.

Mapping data flows

I need help troubleshooting my data flow logic. What info do I need to provide to get help?

When Microsoft provides help or troubleshooting with data flows, please provide the ADF pipeline support files. This Zip file contains the code-behind script from your data flow graph. From the ADF UI, click ... next to pipeline, and then click Download support files.

How do I access data by using the other 90 dataset types in Data Factory?

The mapping data flow feature currently allows Azure SQL Database, Azure Synapse Analytics, delimited text files from Azure Blob storage or Azure Data Lake Storage Gen2, and Parquet files from Blob storage or Data Lake Storage Gen2 natively for source and sink.

Use the Copy activity to stage data from any of the other connectors, and then execute a Data Flow activity to transform data after it's been staged. For example, your pipeline will first copy into Blob storage, and then a Data Flow activity will use a dataset in source to transform that data.

Is the self-hosted integration runtime available for data flows?

Self-hosted IR is an ADF pipeline construct that you can use with the Copy Activity to acquire or move data to and from on-prem or VM-based data sources and sinks. The virtual machines that you use for a self-hosted IR can also be placed inside of the same VNET as your protected data stores for access to those data stores from ADF. With data flows, you'll achieve these same end-results using the Azure IR with managed VNET instead.

Does the data flow compute engine serve multiple tenants?

Clusters are never shared. We guarantee isolation for each job run in production runs. In case of debug scenario one person gets one cluster, and all debugs will go to that cluster which are initiated by that user.

Is there a way to write attributes in cosmos db in the same order as specified in the sink in ADF data flow?

For cosmos DB, the underlying format of each document is a JSON object which is an unordered set of name/value pairs, so the order cannot be reserved.

Why a user is unable to use data preview in the data flows?

You should check permissions for custom role. There are multiple actions involved in the dataflow data preview. You start by checking network traffic while debugging on your browser. Please follow all of the actions, for details, please refer to Resource provider.

In ADF, can I calculate value for a new column from existing column from mapping?

You can use derive transformation in mapping data flow to create a new column on the logic you want. When creating a derived column, you can either generate a new column or update an existing one. In the Column textbox, enter in the column you are creating. To override an existing column in your schema, you can use the column dropdown. To build the derived column's expression, click on the Enter expression textbox. You can either start typing your expression or open up the expression builder to construct your logic.

Why mapping data flow preview failing with Gateway timeout?

Please try to use larger cluster and leverage the row limits in debug settings to a smaller value to reduce the size of debug output.

How to parameterize column name in dataflow?

Column name can be parameterized similar to other properties. Like in derived column customer can use $ColumnNameParam = toString(byName($myColumnNameParamInData)). These parameters can be passed from pipeline execution down to Data flows.

The data flow advisory about TTL and costs

This troubleshoot document may help to resolve your issues: Mapping data flows performance and tuning guide-Time to live.

Power Query Data Wrangling

What are the supported regions for data wrangling?

Data factory is available in following regions. The Power Query feature is available in all data flow regions. If the feature is not available in your region, please check with support.

What is the difference between mapping data flow and Power query actvity (data wrangling)?

Mapping data flows provide a way to transform data at scale without any coding required. You can design a data transformation job in the data flow canvas by constructing a series of transformations. Start with any number of source transformations followed by data transformation steps. Complete your data flow with a sink to land your results in a destination. Mapping data flow is great at mapping and transforming data with both known and unknown schemas in the sinks and sources.

Power Query Data Wrangling allow you to do agile data preparation and exploration using the Power Query Online mashup editor at scale via spark execution. With the rise of data lakes sometimes you just need to explore a data set or create a dataset in the lake. You aren't mapping to a known target.

Supported SQL Types

Power Query Data Wrangling supports the following data types in SQL. You will get a validation error for using a data type that isn't supported.

  • short
  • double
  • real
  • float
  • char
  • nchar
  • varchar
  • nvarchar
  • integer
  • int
  • bit
  • boolean
  • smallint
  • tinyint
  • bigint
  • long
  • text
  • date
  • datetime
  • datetime2
  • smalldatetime
  • timestamp
  • uniqueidentifier
  • xml

Next steps

For step-by-step instructions to create a data factory, see the following tutorials: