Azure Data Factory - Frequently Asked Questions

General questions

What is Azure Data Factory?

Data Factory is a cloud-based data integration service that automates the movement and transformation of data. Just like a factory that runs equipment to take raw materials and transform them into finished goods, Data Factory orchestrates existing services that collect raw data and transform it into ready-to-use information.

Data Factory allows you to create data-driven workflows to move data between both on-premises and cloud data stores as well as process/transform data using compute services such as Azure HDInsight and Azure Data Lake Analytics. After you create a pipeline that performs the action that you need, you can schedule it to run periodically (hourly, daily, weekly etc.).

For more information, see Overview & Key Concepts.

Where can I find pricing details for Azure Data Factory?

See Data Factory Pricing Details page for the pricing details for the Azure Data Factory.

How do I get started with Azure Data Factory?

What is the Data Factory’s region availability?

Data Factory is available in US West and North Europe. The compute and storage services used by data factories can be in other regions. See Supported regions.

What are the limits on number of data factories/pipelines/activities/datasets?

See Azure Data Factory Limits section of the Azure Subscription and Service Limits, Quotas, and Constraints article.

What is the authoring/developer experience with Azure Data Factory service?

You can author/create data factories using one of the following tools/SDKs:

Can I rename a data factory?

No. Like other Azure resources, the name of an Azure data factory cannot be changed.

Can I move a data factory from one Azure subscription to another?

Yes. Use the Move button on your data factory blade as shown in the following diagram:

Move data factory

What are the compute environments supported by Data Factory?

The following table provides a list of compute environments supported by Data Factory and the activities that can run on them.

Compute environment activities
On-demand HDInsight cluster or your own HDInsight cluster DotNet, Hive, Pig, MapReduce, Hadoop Streaming
Azure Batch DotNet
Azure Machine Learning Machine Learning activities: Batch Execution and Update Resource
Azure Data Lake Analytics Data Lake Analytics U-SQL
Azure SQL, Azure SQL Data Warehouse, SQL Server Stored Procedure

How does Azure Data Factory compare with SQL Server Integration Services (SSIS)?

See the Azure Data Factory vs. SSIS presentation from one of our MVPs (Most Valued Professionals): Reza Rad. Some of the recent changes in Data Factory may not be listed in the slide deck. We are continuously adding more capabilities to Azure Data Factory. We are continuously adding more capabilities to Azure Data Factory. We will incorporate these updates into the comparison of data integration technologies from Microsoft sometime later this year.

Activities - FAQ

What are the different types of activities you can use in a Data Factory pipeline?

When does an activity run?

The availability configuration setting in the output data table determines when the activity is run. If input datasets are specified, the activity checks whether all the input data dependencies are satisfied (that is, Ready state) before it starts running.

Copy Activity - FAQ

Is it better to have a pipeline with multiple activities or a separate pipeline for each activity?

Pipelines are supposed to bundle related activities. If the datasets that connect them are not consumed by any other activity outside the pipeline, you can keep the activities in one pipeline. This way, you would not need to chain pipeline active periods so that they align with each other. Also, the data integrity in the tables internal to the pipeline is better preserved when updating the pipeline. Pipeline update essentially stops all the activities within the pipeline, removes them, and creates them again. From authoring perspective, it might also be easier to see the flow of data within the related activities in one JSON file for the pipeline.

What are the supported data stores?

Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy data to and from that store.

Category Data store Supported as a source Supported as a sink
Azure Azure Blob storage
  Azure Cosmos DB (DocumentDB API)
  Azure Data Lake Store
  Azure SQL Database
  Azure SQL Data Warehouse
  Azure Search Index
  Azure Table storage
Databases Amazon Redshift
  DB2*
  MySQL*
  Oracle*
  PostgreSQL*
  SAP Business Warehouse*
  SAP HANA*
  SQL Server*
  Sybase*
  Teradata*
NoSQL Cassandra*
  MongoDB*
File Amazon S3
  File System*
  FTP
  HDFS*
  SFTP
Others Generic HTTP
  Generic OData
  Generic ODBC*
  Salesforce
  Web Table (table from HTML)
  GE Historian*

Note

Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway on an on-premises/Azure IaaS machine.

What are the supported file formats?

Specifying formats

Azure Data Factory supports the following format types:

Specifying TextFormat

If you want to parse the text files or write the data in text format, set the format type property to TextFormat. You can also specify the following optional properties in the format section. See TextFormat example section on how to configure.

Property Description Allowed values Required
columnDelimiter The character used to separate columns in a file. You can consider to use a rare unprintable char which not likely exists in your data: e.g. specify "\u0001" which represents Start of Heading (SOH). Only one character is allowed. The default value is comma (',').

To use an Unicode character, refer to Unicode Characters to get the corresponding code for it.
No
rowDelimiter The character used to separate rows in a file. Only one character is allowed. The default value is any of the following values on read: ["\r\n", "\r", "\n"] and "\r\n" on write. No
escapeChar The special character used to escape a column delimiter in the content of input file.

You cannot specify both escapeChar and quoteChar for a table.
Only one character is allowed. No default value.

Example: if you have comma (',') as the column delimiter but you want to have the comma character in the text (example: "Hello, world"), you can define ‘$’ as the escape character and use string "Hello$, world" in the source.
No
quoteChar The character used to quote a string value. The column and row delimiters inside the quote characters would be treated as part of the string value. This property is applicable to both input and output datasets.

You cannot specify both escapeChar and quoteChar for a table.
Only one character is allowed. No default value.

For example, if you have comma (',') as the column delimiter but you want to have comma character in the text (example: <Hello, world>), you can define " (double quote) as the quote character and use the string "Hello, world" in the source.
No
nullValue One or more characters used to represent a null value. One or more characters. The default values are "\N" and "NULL" on read and "\N" on write. No
encodingName Specify the encoding name. A valid encoding name. see Encoding.EncodingName Property. Example: windows-1250 or shift_jis. The default value is UTF-8. No
firstRowAsHeader Specifies whether to consider the first row as a header. For an input dataset, Data Factory reads first row as a header. For an output dataset, Data Factory writes first row as a header.

See Scenarios for using firstRowAsHeader and skipLineCount for sample scenarios.
True
False (default)
No
skipLineCount Indicates the number of rows to skip when reading data from input files. If both skipLineCount and firstRowAsHeader are specified, the lines are skipped first and then the header information is read from the input file.

See Scenarios for using firstRowAsHeader and skipLineCount for sample scenarios.
Integer No
treatEmptyAsNull Specifies whether to treat null or empty string as a null value when reading data from an input file. True (default)
False
No

TextFormat example

The following sample shows some of the format properties for TextFormat.

"typeProperties":
{
    "folderPath": "mycontainer/myfolder",
    "fileName": "myblobname",
    "format":
    {
        "type": "TextFormat",
        "columnDelimiter": ",",
        "rowDelimiter": ";",
        "quoteChar": "\"",
        "NullValue": "NaN",
        "firstRowAsHeader": true,
        "skipLineCount": 0,
        "treatEmptyAsNull": true
    }
},

To use an escapeChar instead of quoteChar, replace the line with quoteChar with the following escapeChar:

"escapeChar": "$",

Scenarios for using firstRowAsHeader and skipLineCount

  • You are copying from a non-file source to a text file and would like to add a header line containing the schema metadata (for example: SQL schema). Specify firstRowAsHeader as true in the output dataset for this scenario.
  • You are copying from a text file containing a header line to a non-file sink and would like to drop that line. Specify firstRowAsHeader as true in the input dataset.
  • You are copying from a text file and want to skip a few lines at the beginning that contain no data or header information. Specify skipLineCount to indicate the number of lines to be skipped. If the rest of the file contains a header line, you can also specify firstRowAsHeader. If both skipLineCount and firstRowAsHeader are specified, the lines are skipped first and then the header information is read from the input file

Specifying JsonFormat

To import/export JSON files as-is into/from Azure Cosmos DB, see Import/export JSON documents section in the Azure Cosmos DB connector with details.

If you want to parse the JSON files or write the data in JSON format, set the format type property to JsonFormat. You can also specify the following optional properties in the format section. See JsonFormat example section on how to configure.

Property Description Required
filePattern Indicate the pattern of data stored in each JSON file. Allowed values are: setOfObjects and arrayOfObjects. The default value is setOfObjects. See JSON file patterns section for details about these patterns. No
jsonNodeReference If you want to iterate and extract data from the objects inside an array field with the same pattern, specify the JSON path of that array. This property is supported only when copying data from JSON files. No
jsonPathDefinition Specify the JSON path expression for each column mapping with a customized column name (start with lowercase). This property is supported only when copying data from JSON files, and you can extract data from object or array.

For fields under root object, start with root $; for fields inside the array chosen by jsonNodeReference property, start from the array element. See JsonFormat example section on how to configure.
No
encodingName Specify the encoding name. For the list of valid encoding names, see: Encoding.EncodingName Property. For example: windows-1250 or shift_jis. The default value is: UTF-8. No
nestingSeparator Character that is used to separate nesting levels. The default value is '.' (dot). No

JSON file patterns

Copy activity can parse below patterns of JSON files:

  • Type I: setOfObjects

    Each file contains single object, or line-delimited/concatenated multiple objects. When this option is chosen in an output dataset, copy activity produces a single JSON file with each object per line (line-delimited).

    • single object JSON example

      {
          "time": "2015-04-29T07:12:20.9100000Z",
          "callingimsi": "466920403025604",
          "callingnum1": "678948008",
          "callingnum2": "567834760",
          "switch1": "China",
          "switch2": "Germany"
      }
      
    • line-delimited JSON example

      {"time":"2015-04-29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":"567834760","switch1":"China","switch2":"Germany"}
      {"time":"2015-04-29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":"789037573","switch1":"US","switch2":"UK"}
      {"time":"2015-04-29T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":"345626404","switch1":"Germany","switch2":"UK"}
      
    • concatenated JSON example

      {
          "time": "2015-04-29T07:12:20.9100000Z",
          "callingimsi": "466920403025604",
          "callingnum1": "678948008",
          "callingnum2": "567834760",
          "switch1": "China",
          "switch2": "Germany"
      }
      {
          "time": "2015-04-29T07:13:21.0220000Z",
          "callingimsi": "466922202613463",
          "callingnum1": "123436380",
          "callingnum2": "789037573",
          "switch1": "US",
          "switch2": "UK"
      }
      {
          "time": "2015-04-29T07:13:21.4370000Z",
          "callingimsi": "466923101048691",
          "callingnum1": "678901578",
          "callingnum2": "345626404",
          "switch1": "Germany",
          "switch2": "UK"
      }
      
  • Type II: arrayOfObjects

    Each file contains an array of objects.

    [
        {
            "time": "2015-04-29T07:12:20.9100000Z",
            "callingimsi": "466920403025604",
            "callingnum1": "678948008",
            "callingnum2": "567834760",
            "switch1": "China",
            "switch2": "Germany"
        },
        {
            "time": "2015-04-29T07:13:21.0220000Z",
            "callingimsi": "466922202613463",
            "callingnum1": "123436380",
            "callingnum2": "789037573",
            "switch1": "US",
            "switch2": "UK"
        },
        {
            "time": "2015-04-29T07:13:21.4370000Z",
            "callingimsi": "466923101048691",
            "callingnum1": "678901578",
            "callingnum2": "345626404",
            "switch1": "Germany",
            "switch2": "UK"
        }
    ]
    

JsonFormat example

Case 1: Copying data from JSON files

See below two types of samples when copying data from JSON files, and the generic points to note:

Sample 1: extract data from object and array

In this sample, you expect one root JSON object maps to single record in tabular result. If you have a JSON file with the following content:

{
    "id": "ed0e4960-d9c5-11e6-85dc-d7996816aad3",
    "context": {
        "device": {
            "type": "PC"
        },
        "custom": {
            "dimensions": [
                {
                    "TargetResourceType": "Microsoft.Compute/virtualMachines"
                },
                {
                    "ResourceManagmentProcessRunId": "827f8aaa-ab72-437c-ba48-d8917a7336a3"
                },
                {
                    "OccurrenceTime": "1/13/2017 11:24:37 AM"
                }
            ]
        }
    }
}

and you want to copy it into an Azure SQL table in the following format, by extracting data from both objects and array:

id deviceType targetResourceType resourceManagmentProcessRunId occurrenceTime
ed0e4960-d9c5-11e6-85dc-d7996816aad3 PC Microsoft.Compute/virtualMachines 827f8aaa-ab72-437c-ba48-d8917a7336a3 1/13/2017 11:24:37 AM

The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). More specifically:

  • structure section defines the customized column names and the corresponding data type while converting to tabular data. This section is optional unless you need to do column mapping. See Specifying structure definition for rectangular datasets section for more details.
  • jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. To copy data from array, you can use array[x].property to extract value of the given property from the xth object, or you can use array[*].property to find the value from any object containing such property.
"properties": {
    "structure": [
        {
            "name": "id",
            "type": "String"
        },
        {
            "name": "deviceType",
            "type": "String"
        },
        {
            "name": "targetResourceType",
            "type": "String"
        },
        {
            "name": "resourceManagmentProcessRunId",
            "type": "String"
        },
        {
            "name": "occurrenceTime",
            "type": "DateTime"
        }
    ],
    "typeProperties": {
        "folderPath": "mycontainer/myfolder",
        "format": {
            "type": "JsonFormat",
            "filePattern": "setOfObjects",
            "jsonPathDefinition": {"id": "$.id", "deviceType": "$.context.device.type", "targetResourceType": "$.context.custom.dimensions[0].TargetResourceType", "resourceManagmentProcessRunId": "$.context.custom.dimensions[1].ResourceManagmentProcessRunId", "occurrenceTime": " $.context.custom.dimensions[2].OccurrenceTime"}      
        }
    }
}

Sample 2: cross apply multiple objects with the same pattern from array

In this sample, you expect to transform one root JSON object into multiple records in tabular result. If you have a JSON file with the following content:

{
    "ordernumber": "01",
    "orderdate": "20170122",
    "orderlines": [
        {
            "prod": "p1",
            "price": 23
        },
        {
            "prod": "p2",
            "price": 13
        },
        {
            "prod": "p3",
            "price": 231
        }
    ],
    "city": [ { "sanmateo": "No 1" } ]
}

and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the array and cross join with the common root info:

ordernumber orderdate order_pd order_price city
01 20170122 P1 23 [{"sanmateo":"No 1"}]
01 20170122 P2 13 [{"sanmateo":"No 1"}]
01 20170122 P3 231 [{"sanmateo":"No 1"}]

The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). More specifically:

  • structure section defines the customized column names and the corresponding data type while converting to tabular data. This section is optional unless you need to do column mapping. See Specifying structure definition for rectangular datasets section for more details.
  • jsonNodeReference indicates to iterate and extract data from the objects with the same pattern under array orderlines.
  • jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. In this example, "ordernumber", "orderdate" and "city" are under root object with JSON path starting with "$.", while "order_pd" and "order_price" are defined with path derived from the array element without "$.".
"properties": {
    "structure": [
        {
            "name": "ordernumber",
            "type": "String"
        },
        {
            "name": "orderdate",
            "type": "String"
        },
        {
            "name": "order_pd",
            "type": "String"
        },
        {
            "name": "order_price",
            "type": "Int64"
        },
        {
            "name": "city",
            "type": "String"
        }
    ],
    "typeProperties": {
        "folderPath": "mycontainer/myfolder",
        "format": {
            "type": "JsonFormat",
            "filePattern": "setOfObjects",
            "jsonNodeReference": "$.orderlines",
            "jsonPathDefinition": {"ordernumber": "$.ordernumber", "orderdate": "$.orderdate", "order_pd": "prod", "order_price": "price", "city": " $.city"}         
        }
    }
}

Note the following points:

  • If the structure and jsonPathDefinition are not defined in the Data Factory dataset, the Copy Activity detects the schema from the first object and flatten the whole object.
  • If the JSON input has an array, by default the Copy Activity converts the entire array value into a string. You can choose to extract data from it using jsonNodeReference and/or jsonPathDefinition, or skip it by not specifying it in jsonPathDefinition.
  • If there are duplicate names at the same level, the Copy Activity picks the last one.
  • Property names are case-sensitive. Two properties with same name but different casings are treated as two separate properties.

Case 2: Writing data to JSON file

If you have below table in SQL Database:

id order_date order_price order_by
1 20170119 2000 David
2 20170120 3500 Patrick
3 20170121 4000 Jason

and for each record, you expect to write to a JSON object in below format:

{
    "id": "1",
    "order": {
        "date": "20170119",
        "price": 2000,
        "customer": "David"
    }
}

The output dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). More specifically, structure section defines the customized property names in destination file, nestingSeparator (default is ".") will be used to identify the nest layer from the name. This section is optional unless you want to change the property name comparing with source column name, or nest some of the properties.

"properties": {
    "structure": [
        {
            "name": "id",
            "type": "String"
        },
        {
            "name": "order.date",
            "type": "String"
        },
        {
            "name": "order.price",
            "type": "Int64"
        },
        {
            "name": "order.customer",
            "type": "String"
        }
    ],
    "typeProperties": {
        "folderPath": "mycontainer/myfolder",
        "format": {
            "type": "JsonFormat"
        }
    }
}

Specifying AvroFormat

If you want to parse the Avro files or write the data in Avro format, set the format type property to AvroFormat. You do not need to specify any properties in the Format section within the typeProperties section. Example:

"format":
{
    "type": "AvroFormat",
}

To use Avro format in a Hive table, you can refer to Apache Hive’s tutorial.

Note the following points:

Specifying OrcFormat

If you want to parse the ORC files or write the data in ORC format, set the format type property to OrcFormat. You do not need to specify any properties in the Format section within the typeProperties section. Example:

"format":
{
    "type": "OrcFormat"
}

Important

If you are not copying ORC files as-is between on-premises and cloud data stores, you need to install the JRE 8 (Java Runtime Environment) on your gateway machine. A 64-bit gateway requires 64-bit JRE and 32-bit gateway requires 32-bit JRE. You can find both versions from here. Choose the appropriate one.

Note the following points:

  • Complex data types are not supported (STRUCT, MAP, LIST, UNION)
  • ORC file has three compression-related options: NONE, ZLIB, SNAPPY. Data Factory supports reading data from ORC file in any of these compressed formats. It uses the compression codec is in the metadata to read the data. However, when writing to an ORC file, Data Factory chooses ZLIB, which is the default for ORC. Currently, there is no option to override this behavior.

Specifying ParquetFormat

If you want to parse the Parquet files or write the data in Parquet format, set the format type property to ParquetFormat. You do not need to specify any properties in the Format section within the typeProperties section. Example:

"format":
{
    "type": "ParquetFormat"
}

Important

If you are not copying Parquet files as-is between on-premises and cloud data stores, you need to install the JRE 8 (Java Runtime Environment) on your gateway machine. A 64-bit gateway requires 64-bit JRE and 32-bit gateway requires 32-bit JRE. You can find both versions from here. Choose the appropriate one.

Note the following points:

  • Complex data types are not supported (MAP, LIST)
  • Parquet file has the following compression-related options: NONE, SNAPPY, GZIP, and LZO. Data Factory supports reading data from ORC file in any of these compressed formats. It uses the compression codec in the metadata to read the data. However, when writing to a Parquet file, Data Factory chooses SNAPPY, which is the default for Parquet format. Currently, there is no option to override this behavior.

Where is the copy operation performed?

See Globally available data movement section for details. In short, when an on-premises data store is involved, the copy operation is performed by the Data Management Gateway in your on-premises environment. And, when the data movement is between two cloud stores, the copy operation is performed in the region closest to the sink location in the same geography.

HDInsight Activity - FAQ

What regions are supported by HDInsight?

See the Geographic Availability section in the following article: or HDInsight Pricing Details.

What region is used by an on-demand HDInsight cluster?

The on-demand HDInsight cluster is created in the same region where the storage you specified to be used with the cluster exists.

How to associate additional storage accounts to your HDInsight cluster?

If you are using your own HDInsight Cluster (BYOC - Bring Your Own Cluster), see the following topics:

If you are using an on-demand cluster that is created by the Data Factory service, specify additional storage accounts for the HDInsight linked service so that the Data Factory service can register them on your behalf. In the JSON definition for the on-demand linked service, use additionalLinkedServiceNames property to specify alternate storage accounts as shown in the following JSON snippet:

{
    "name": "MyHDInsightOnDemandLinkedService",
    "properties":
    {
        "type": "HDInsightOnDemandLinkedService",
        "typeProperties": {
            "version": "3.5",
            "clusterSize": 1,
            "timeToLive": "00:05:00",
            "osType": "Linux",
            "linkedServiceName": "LinkedService-SampleData",
            "additionalLinkedServiceNames": [ "otherLinkedServiceName1", "otherLinkedServiceName2" ]
        }
    }
}

In the example above, otherLinkedServiceName1 and otherLinkedServiceName2 represent linked services whose definitions contain credentials that the HDInsight cluster needs to access alternate storage accounts.

Slices - FAQ

Why are my input slices not in Ready state?

A common mistake is not setting external property to true on the input dataset when the input data is external to the data factory (not produced by the data factory).

In the following example, you only need to set external to true on dataset1.

DataFactory1 Pipeline 1: dataset1 -> activity1 -> dataset2 -> activity2 -> dataset3 Pipeline 2: dataset3-> activity3 -> dataset4

If you have another data factory with a pipeline that takes dataset4 (produced by pipeline 2 in data factory 1), mark dataset4 as an external dataset because the dataset is produced by a different data factory (DataFactory1, not DataFactory2).

DataFactory2
Pipeline 1: dataset4->activity4->dataset5

If the external property is properly set, verify whether the input data exists in the location specified in the input dataset definition.

How to run a slice at another time than midnight when the slice is being produced daily?

Use the offset property to specify the time at which you want the slice to be produced. See Dataset availability section for details about this property. Here is a quick example:

"availability":
{
    "frequency": "Day",
    "interval": 1,
    "offset": "06:00:00"
}

Daily slices start at 6 AM instead of the default midnight.

How can I rerun a slice?

You can rerun a slice in one of the following ways:

  • Use Monitor and Manage App to rerun an activity window or slice. See Rerun selected activity windows for instructions.
  • Click Run in the command bar on the DATA SLICE blade for the slice in the Azure portal.
  • Run Set-AzureRmDataFactorySliceStatus cmdlet with Status set to Waiting for the slice.

    Set-AzureRmDataFactorySliceStatus -Status Waiting -ResourceGroupName $ResourceGroup -DataFactoryName $df -TableName $table -StartDateTime "02/26/2015 19:00:00" -EndDateTime "02/26/2015 20:00:00"
    

    See Set-AzureRmDataFactorySliceStatus for details about the cmdlet.

How long did it take to process a slice?

Use Activity Window Explorer in Monitor & Manage App to know how long it took to process a data slice. See Activity Window Explorer for details.

You can also do the following in the Azure portal:

  1. Click Datasets tile on the DATA FACTORY blade for your data factory.
  2. Click the specific dataset on the Datasets blade.
  3. Select the slice that you are interested in from the Recent slices list on the TABLE blade.
  4. Click the activity run from the Activity Runs list on the DATA SLICE blade.
  5. Click Properties tile on the ACTIVITY RUN DETAILS blade.
  6. You should see the DURATION field with a value. This value is the time taken to process the slice.

How to stop a running slice?

If you need to stop the pipeline from executing, you can use Suspend-AzureRmDataFactoryPipeline cmdlet. Currently, suspending the pipeline does not stop the slice executions that are in progress. Once the in-progress executions finish, no extra slice is picked up.

If you really want to stop all the executions immediately, the only way would be to delete the pipeline and create it again. If you choose to delete the pipeline, you do NOT need to delete tables and linked services used by the pipeline.