Azure Data Factory - Frequently asked questions

Note

This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see frequently asked question - Data Factory.

Note

This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell. To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

General questions

What is Azure Data Factory?

Data Factory is a cloud-based data integration service that automates the movement and transformation of data. Just like a factory that runs equipment to take raw materials and transform them into finished goods, Data Factory orchestrates existing services that collect raw data and transform it into ready-to-use information.

Data Factory allows you to create data-driven workflows to move data between both on-premises and cloud data stores as well as process/transform data using compute services such as Azure HDInsight and Azure Data Lake Analytics. After you create a pipeline that performs the action that you need, you can schedule it to run periodically (hourly, daily, weekly etc.).

For more information, see Overview & Key Concepts.

Where can I find pricing details for Azure Data Factory?

See Data Factory Pricing Details page for the pricing details for the Azure Data Factory.

How do I get started with Azure Data Factory?

What is the Data Factory's region availability?

Data Factory is available in US West and North Europe. The compute and storage services used by data factories can be in other regions. See Supported regions.

What are the limits on number of data factories/pipelines/activities/datasets?

See Azure Data Factory Limits section of the Azure Subscription and Service Limits, Quotas, and Constraints article.

What is the authoring/developer experience with Azure Data Factory service?

You can author/create data factories using one of the following tools/SDKs:

Can I rename a data factory?

No. Like other Azure resources, the name of an Azure data factory cannot be changed.

Can I move a data factory from one Azure subscription to another?

Yes. Use the Move button on your data factory blade as shown in the following diagram:

Move data factory

What are the compute environments supported by Data Factory?

How does Azure Data Factory compare with SQL Server Integration Services (SSIS)?

See the Azure Data Factory vs. SSIS presentation from one of our MVPs (Most Valued Professionals): Reza Rad. Some of the recent changes in Data Factory may not be listed in the slide deck. We are continuously adding more capabilities to Azure Data Factory. We are continuously adding more capabilities to Azure Data Factory. We will incorporate these updates into the comparison of data integration technologies from Microsoft sometime later this year.

Activities - FAQ

What are the different types of activities you can use in a Data Factory pipeline?

When does an activity run?

The availability configuration setting in the output data table determines when the activity is run. If input datasets are specified, the activity checks whether all the input data dependencies are satisfied (that is, Ready state) before it starts running.

Copy Activity - FAQ

Is it better to have a pipeline with multiple activities or a separate pipeline for each activity?

Pipelines are supposed to bundle related activities. If the datasets that connect them are not consumed by any other activity outside the pipeline, you can keep the activities in one pipeline. This way, you would not need to chain pipeline active periods so that they align with each other. Also, the data integrity in the tables internal to the pipeline is better preserved when updating the pipeline. Pipeline update essentially stops all the activities within the pipeline, removes them, and creates them again. From authoring perspective, it might also be easier to see the flow of data within the related activities in one JSON file for the pipeline.

What are the supported data stores?

Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy data to and from that store.

Category Data store Supported as a source Supported as a sink
Azure Azure Blob storage
  Azure Cosmos DB (SQL API)
  Azure Data Lake Storage Gen1
  Azure SQL Database
  Azure Synapse Analytics
  Azure Cognitive Search Index
  Azure Table storage
Databases Amazon Redshift
  DB2*
  MySQL*
  Oracle*
  PostgreSQL*
  SAP Business Warehouse*
  SAP HANA*
  SQL Server*
  Sybase*
  Teradata*
NoSQL Cassandra*
  MongoDB*
File Amazon S3
  File System*
  FTP
  HDFS*
  SFTP
Others Generic HTTP
  Generic OData
  Generic ODBC*
  Salesforce
  Web Table (table from HTML)

Note

Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway on an on-premises/Azure IaaS machine.

What are the supported file formats?

Azure Data Factory supports the following file format types:

Where is the copy operation performed?

See Globally available data movement section for details. In short, when an on-premises data store is involved, the copy operation is performed by the Data Management Gateway in your on-premises environment. And, when the data movement is between two cloud stores, the copy operation is performed in the region closest to the sink location in the same geography.

HDInsight Activity - FAQ

What regions are supported by HDInsight?

See the Geographic Availability section in the following article: or HDInsight Pricing Details.

What region is used by an on-demand HDInsight cluster?

The on-demand HDInsight cluster is created in the same region where the storage you specified to be used with the cluster exists.

How to associate additional storage accounts to your HDInsight cluster?

If you are using your own HDInsight Cluster (BYOC - Bring Your Own Cluster), see the following topics:

If you are using an on-demand cluster that is created by the Data Factory service, specify additional storage accounts for the HDInsight linked service so that the Data Factory service can register them on your behalf. In the JSON definition for the on-demand linked service, use additionalLinkedServiceNames property to specify alternate storage accounts as shown in the following JSON snippet:

{
    "name": "MyHDInsightOnDemandLinkedService",
    "properties":
    {
        "type": "HDInsightOnDemandLinkedService",
        "typeProperties": {
            "version": "3.5",
            "clusterSize": 1,
            "timeToLive": "00:05:00",
            "osType": "Linux",
            "linkedServiceName": "LinkedService-SampleData",
            "additionalLinkedServiceNames": [ "otherLinkedServiceName1", "otherLinkedServiceName2" ]
        }
    }
}

In the example above, otherLinkedServiceName1 and otherLinkedServiceName2 represent linked services whose definitions contain credentials that the HDInsight cluster needs to access alternate storage accounts.

Slices - FAQ

Why are my input slices not in Ready state?

A common mistake is not setting external property to true on the input dataset when the input data is external to the data factory (not produced by the data factory).

In the following example, you only need to set external to true on dataset1.

DataFactory1 Pipeline 1: dataset1 -> activity1 -> dataset2 -> activity2 -> dataset3 Pipeline 2: dataset3-> activity3 -> dataset4

If you have another data factory with a pipeline that takes dataset4 (produced by pipeline 2 in data factory 1), mark dataset4 as an external dataset because the dataset is produced by a different data factory (DataFactory1, not DataFactory2).

DataFactory2
Pipeline 1: dataset4->activity4->dataset5

If the external property is properly set, verify whether the input data exists in the location specified in the input dataset definition.

How to run a slice at another time than midnight when the slice is being produced daily?

Use the offset property to specify the time at which you want the slice to be produced. See Dataset availability section for details about this property. Here is a quick example:

"availability":
{
    "frequency": "Day",
    "interval": 1,
    "offset": "06:00:00"
}

Daily slices start at 6 AM instead of the default midnight.

How can I rerun a slice?

You can rerun a slice in one of the following ways:

  • Use Monitor and Manage App to rerun an activity window or slice. See Rerun selected activity windows for instructions.

  • Click Run in the command bar on the DATA SLICE blade for the slice in the Azure portal.

  • Run Set-AzDataFactorySliceStatus cmdlet with Status set to Waiting for the slice.

    Set-AzDataFactorySliceStatus -Status Waiting -ResourceGroupName $ResourceGroup -DataFactoryName $df -TableName $table -StartDateTime "02/26/2015 19:00:00" -EndDateTime "02/26/2015 20:00:00"
    

    See Set-AzDataFactorySliceStatus for details about the cmdlet.

How long did it take to process a slice?

Use Activity Window Explorer in Monitor & Manage App to know how long it took to process a data slice. See Activity Window Explorer for details.

You can also do the following in the Azure portal:

  1. Click Datasets tile on the DATA FACTORY blade for your data factory.
  2. Click the specific dataset on the Datasets blade.
  3. Select the slice that you are interested in from the Recent slices list on the TABLE blade.
  4. Click the activity run from the Activity Runs list on the DATA SLICE blade.
  5. Click Properties tile on the ACTIVITY RUN DETAILS blade.
  6. You should see the DURATION field with a value. This value is the time taken to process the slice.

How to stop a running slice?

If you need to stop the pipeline from executing, you can use Suspend-AzDataFactoryPipeline cmdlet. Currently, suspending the pipeline does not stop the slice executions that are in progress. Once the in-progress executions finish, no extra slice is picked up.

If you really want to stop all the executions immediately, the only way would be to delete the pipeline and create it again. If you choose to delete the pipeline, you do NOT need to delete tables and linked services used by the pipeline.