Datasets and linked services in Azure Data Factory

This article describes what datasets are, how they are defined in JSON format, and how they are used in Azure Data Factory V2 pipelines.

Note

This article applies to version 2 of Data Factory, which is currently in preview. If you are using version 1 of the Data Factory service, which is generally available (GA), see Datasets in Data Factory V1.

If you are new to Data Factory, see Introduction to Azure Data Factory for an overview.

Overview

A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task. The activities in a pipeline define actions to perform on your data. For example, you might use a copy activity to copy data from an on-premises SQL Server to Azure Blob storage. Then, you might use a Hive activity that runs a Hive script on an Azure HDInsight cluster to process data from Blob storage to produce output data. Finally, you might use a second copy activity to copy the output data to Azure SQL Data Warehouse, on top of which business intelligence (BI) reporting solutions are built. For more information about pipelines and activities, see Pipelines and activities in Azure Data Factory.

Now, a dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files, folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in Blob storage from which the activity should read the data.

Before you create a dataset, you must create a linked service to link your data store to the data factory. Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. Think of it this way; the dataset represents the structure of the data within the linked data stores, and the linked service defines the connection to the data source. For example, an Azure Storage linked service links a storage account to the data factory. An Azure Blob dataset represents the blob container and the folder within that Azure storage account that contains the input blobs to be processed.

Here is a sample scenario. To copy data from Blob storage to a SQL database, you create two linked services: Azure Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset (which refers to the Azure Storage linked service) and Azure SQL Table dataset (which refers to the Azure SQL Database linked service). The Azure Storage and Azure SQL Database linked services contain connection strings that Data Factory uses at runtime to connect to your Azure Storage and Azure SQL Database, respectively. The Azure Blob dataset specifies the blob container and blob folder that contains the input blobs in your Blob storage. The Azure SQL Table dataset specifies the SQL table in your SQL database to which the data is to be copied.

The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data Factory:

Relationship between pipeline, activity, dataset, linked services

Linked service JSON

A linked service in Data Factory is defined in JSON format as follows:

{
    "name": "<Name of the linked service>",
    "properties": {
        "type": "<Type of the linked service>",
        "typeProperties": {
              "<data store or compute-specific type properties>"
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

The following table describes properties in the above JSON:

Property Description Required
name Name of the linked service. See Azure Data Factory - Naming rules. Yes
type Type of the linked service. For example: AzureStorage (data store) or AzureBatch (compute). See the description for typeProperties. Yes
typeProperties The type properties are different for each data store or compute.

For the supported data store types and their type properties, see the dataset type table in this article. Navigate to the data store connector article to learn about type properties specific to a data store.

For the supported compute types and their type properties, see Compute linked services.
Yes
connectVia The Integration Runtime to be used to connect to the data store. You can use Azure Integration Runtime or Self-hosted Integration Runtime (if your data store is located in a private network). If not specified, it uses the default Azure Integration Runtime. No

Linked service example

The following linked service is an Azure Storage linked service. Notice that the type is set to AzureStorage. The type properties for the Azure Storage linked service include a connection string. The Data Factory service uses this connection string to connect to the data store at runtime.

{
    "name": "AzureStorageLinkedService",
    "properties": {
        "type": "AzureStorage",
        "typeProperties": {
            "connectionString": {
                "type": "SecureString",
                "value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

Dataset JSON

A dataset in Data Factory is defined in JSON format as follows:

{
    "name": "<name of dataset>",
    "properties": {
        "type": "<type of dataset: AzureBlob, AzureSql etc...>",
        "linkedServiceName": {
                "referenceName": "<name of linked service>",
                 "type": "LinkedServiceReference",
        },
        "structure": [
            {
                "name": "<Name of the column>",
                "type": "<Name of the type>"
            }
        ],
        "typeProperties": {
            "<type specific property>": "<value>",
            "<type specific property 2>": "<value 2>",
        }
    }
}

The following table describes properties in the above JSON:

Property Description Required
name Name of the dataset. See Azure Data Factory - Naming rules. Yes
type Type of the dataset. Specify one of the types supported by Data Factory (for example: AzureBlob, AzureSqlTable).

For details, see Dataset types.
Yes
structure Schema of the dataset. For details, see Dataset structure. No
typeProperties The type properties are different for each type (for example: Azure Blob, Azure SQL table). For details on the supported types and their properties, see Dataset type. Yes

Dataset example

In the following example, the dataset represents a table named MyTable in a SQL database.

{
    "name": "DatasetSample",
    "properties": {
        "type": "AzureSqlTable",
        "linkedServiceName": {
                "referenceName": "MyAzureSqlLinkedService",
                 "type": "LinkedServiceReference",
        },
        "typeProperties":
        {
            "tableName": "MyTable"
        },
    }
}

Note the following points:

  • type is set to AzureSqlTable.
  • tableName type property (specific to AzureSqlTable type) is set to MyTable.
  • linkedServiceName refers to a linked service of type AzureSqlDatabase, which is defined in the next JSON snippet.

Dataset type

There are many different types of datasets, depending on the data store you use. See the following table for a list of data stores supported by Data Factory. Click a data store to learn how to create a linked service and a dataset for that data store.

Category Data store Supported as a source Supported as a sink Supported by Azure IR Supported by Self-hosted IR
Azure Azure Blob Storage
  Azure Cosmos DB
  Azure Data Lake Store
  Azure Database for MySQL
  Azure Database for PostgreSQL
  Azure File Storage
  Azure SQL Database
  Azure SQL Data Warehouse
  Azure Search Index
  Azure Table Storage
Database Amazon Redshift
  DB2
  Drill (Preview)
  Google BigQuery
  Greenplum
  HBase
  Hive
  Apache Impala (Preview)
  Informix
  MariaDB
  Microsoft Access
  MySQL
  Netezza
  Oracle
  Phoenix
  PostgreSQL
  Presto (Preview)
  SAP Business Warehouse
  SAP HANA
  Spark
  SQL Server
  Sybase
  Teradata
  Vertica
NoSQL Cassandra
  Couchbase (Preview)
  MongoDB
File Amazon S3
  File System
  FTP
  HDFS
  SFTP
Generic protocol Generic HTTP
  Generic OData
  Generic ODBC
Services and apps Amazon Marketplace Web Service (Preview)
  Common Data Service for Apps
  Concur (Preview)
  Dynamics 365
  Dynamics CRM
  GE Historian
  HubSpot (Preview)
  Jira (Preview)
  Magento (Preview)
  Marketo (Preview)
  Oracle Eloqua (Preview)
  Oracle Responsys (Preview)
  Paypal (Preview)
  QuickBooks (Preview)
  Salesforce
  Salesforce Service Cloud
  Salesforce Marketing Cloud (Preview)
  SAP Cloud for Customer (C4C)
  SAP ECC
  ServiceNow
  Shopify (Preview)
  Square (Preview)
  Web Table (HTML table)
  Xero (Preview)
  Zoho (Preview)

Note

Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a dependency on preview connectors in your solution, please contact Azure support.

In the example in the previous section, the type of the dataset is set to AzureSqlTable. Similarly, for an Azure Blob dataset, the type of the dataset is set to AzureBlob, as shown in the following JSON:

{
    "name": "AzureBlobInput",
    "properties": {
        "type": "AzureBlob",
        "linkedServiceName": {
                "referenceName": "MyAzureStorageLinkedService",
                 "type": "LinkedServiceReference",
        }, 

        "typeProperties": {
            "fileName": "input.log",
            "folderPath": "adfgetstarted/inputdata",
            "format": {
                "type": "TextFormat",
                "columnDelimiter": ","
            }
        }
    }
}

Dataset structure

The structure section is optional. It defines the schema of the dataset by containing a collection of names and data types of columns. You use the structure section to provide type information that is used to convert types and map columns from the source to the destination.

Each column in the structure contains the following properties:

Property Description Required
name Name of the column. Yes
type Data type of the column. Data Factory supports the following interim data types as allowed values: Int16, Int32, Int64, Single, Double, Decimal, Byte[], Boolean, String, Guid, Datetime, Datetimeoffset, and Timespan No
culture .NET-based culture to be used when the type is a .NET type: Datetime or Datetimeoffset. The default is en-us. No
format Format string to be used when the type is a .NET type: Datetime or Datetimeoffset. Refer to Custom Date and Time Format Strings on how to format datetime. No

Example

In the following example, suppose the source Blob data is in CSV format and contains three columns: userid, name, and lastlogindate. They are of type Int64, String, and Datetime with a custom datetime format using abbreviated French names for day of the week.

Define the Blob dataset structure as follows along with type definitions for the columns:

"structure":
[
    { "name": "userid", "type": "Int64"},
    { "name": "name", "type": "String"},
    { "name": "lastlogindate", "type": "Datetime", "culture": "fr-fr", "format": "ddd-MM-YYYY"}
]

Guidance

The following guidelines help you understand when to include structure information, and what to include in the structure section. Learn more on how data factory maps source data to sink and when to specify structure information from Schema and type mapping.

  • For strong schema data sources, specify the structure section only if you want map source columns to sink columns, and their names are not the same. This kind of structured data source stores data schema and type information along with the data itself. Examples of structured data sources include SQL Server, Oracle, and Azure SQL Database.

    As type information is already available for structured data sources, you should not include type information when you do include the structure section.
  • For no/weak schema data sources e.g. text file in blob storage, include structure when the dataset is an input for a copy activity, and data types of source dataset should be converted to native types for the sink. And include structure when you want to map source columns to sink columns..

Create datasets

You can create datasets by using one of these tools or SDKs: .NET API, PowerShell, REST API, Azure Resource Manager Template, and Azure portal

V1 vs. V2 datasets

Here are some differences between Data Factory v1 and v2 datasets:

  • The external property is not supported in v2. It's replaced by a trigger.
  • The policy and availability properties are not supported in V2. The start time for a pipeline depends on triggers.
  • Scoped datasets (datasets defined in a pipeline) are not supported in V2.

Next steps

See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one of these tools or SDKs.