Datasets in Azure Data Factory

This article describes datasets in Azure Data Factory and includes examples such as offset, anchorDateTime, and offset/style databases.

When you create a dataset, you’re creating a pointer to the data that you want to process. Data is processed (input/output) in an activity and an activity is contained in a pipeline. An input dataset represents the input for an activity in the pipeline and an output dataset represents the output for the activity.

Datasets identify data within different data stores, such as tables, files, folders, and documents. After you create a dataset, you can use it with activities in a pipeline. For example, a dataset can be an input/output dataset of a Copy Activity or an HDInsightHive Activity. The Azure portal gives you a visual layout of all your pipelines and data inputs and outputs. At a glance, you see all the relationships and dependencies of your pipelines across all your sources so you always know where data is coming from and where it is going.

In Azure Data Factory, you can get data from a dataset by using copy activity in a pipeline.

Note

If you are new to Azure Data Factory, see Introduction to Azure Data Factory for an overview of Azure Data Factory service. See Build your first data factory for a tutorial to create your first data factory. These two articles provide you background information you need to understand this article better.

Define datasets

A dataset in Azure Data Factory is defined as follows:

{
    "name": "<name of dataset>",
    "properties": {
        "type": "<type of dataset: AzureBlob, AzureSql etc...>",
        "external": <boolean flag to indicate external data. only for input datasets>,
        "linkedServiceName": "<Name of the linked service that refers to a data store.>",
        "structure": [
            {
                "name": "<Name of the column>",
                "type": "<Name of the type>"
            }
        ],
        "typeProperties": {
            "<type specific property>": "<value>",
            "<type specific property 2>": "<value 2>",
        },
        "availability": {
            "frequency": "<Specifies the time unit for data slice production. Supported frequency: Minute, Hour, Day, Week, Month>",
            "interval": "<Specifies the interval within the defined frequency. For example, frequency set to 'Hour' and interval set to 1 indicates that new data slices should be produced hourly>"
        },
       "policy":
        {      
        }
    }
}

The following table describes properties in the above JSON:

Property Description Required Default
name Name of the dataset. See Azure Data Factory - Naming rules for naming rules. Yes NA
type Type of the dataset. Specify one of the types supported by Azure Data Factory (for example: AzureBlob, AzureSqlTable).

See Dataset Type for details.
Yes NA
structure Schema of the dataset

For details, see Dataset Structure section.
No. NA
typeProperties Properties corresponding to the selected type. See Dataset Type section for details on the supported types and their properties. Yes NA
external Boolean flag to specify whether a dataset is explicitly produced by a data factory pipeline or not. No false
availability Defines the processing window or the slicing model for the dataset production.

For details, see Dataset Availability section.

For details on the dataset slicing model, see Scheduling and Execution article.
Yes NA
policy Defines the criteria or the condition that the dataset slices must fulfill.

For details, see Dataset Policy section.
No NA

Dataset example

In the following example, the dataset represents a table named MyTable in an Azure SQL database.

{
    "name": "DatasetSample",
    "properties": {
        "type": "AzureSqlTable",
        "linkedServiceName": "AzureSqlLinkedService",
        "typeProperties":
        {
            "tableName": "MyTable"
        },
        "availability":
        {
            "frequency": "Day",
            "interval": 1
        }
    }
}

Note the following points:

  • type is set to AzureSqlTable.
  • tableName type property (specific to AzureSqlTable type) is set to MyTable.
  • linkedServiceName refers to a linked service of type AzureSqlDatabase, which is defined in the following JSON snippet.
  • availability frequency is set to Day and interval is set to 1, which means that the slice is produced daily.

AzureSqlLinkedService is defined as follows:

{
    "name": "AzureSqlLinkedService",
    "properties": {
        "type": "AzureSqlDatabase",
        "description": "",
        "typeProperties": {
            "connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=<databasename>;User ID=<username>@<servername>;Password=<password>;Integrated Security=False;Encrypt=True;Connect Timeout=30"
        }
    }
}

In the above JSON:

  • type is set to AzureSqlDatabase
  • connectionString type property specifies information to connect to an Azure SQL database.

As you can see, the linked service defines how to connect to an Azure SQL database. The dataset defines what table is used as an input/output for the activity in a pipeline. The activity section in your pipeline JSON specifies whether the dataset is used as an input or output dataset.

Important

Unless a dataset is being produced by Azure Data Factory, it should be marked as external. This setting generally applies to inputs of first activity in a pipeline.

Dataset Type

The supported data sources and dataset types are aligned. See topics referenced in the Data Movement Activities article for information on types and configuration of datasets. For example, if you are using data from an Azure SQL database, click Azure SQL Database in the list of supported data stores to see detailed information.

Dataset Structure

The structure section is an optional section that defines the schema of the dataset. It contains a collection of names and data types of columns. You use the structure section for either providing type information for type conversions or doing column mappings. In the following example, the dataset has three columns slicetimestamp, projectname, and pageviews and they are of type: String, String, and Decimal respectively.

structure:  
[
    { "name": "slicetimestamp", "type": "String"},
    { "name": "projectname", "type": "String"},
    { "name": "pageviews", "type": "Decimal"}
]

Each column contains the following properties:

Property Description Required
name Name of the column. Yes
type Data type of the column. No
culture .NET based culture to be used when type is specified and is .NET type Datetime or Datetimeoffset. Default is “en-us”. No
format Format string to be used when type is specified and is .NET type Datetime or Datetimeoffset. No

Use the following guidelines for when to include “structure” information and what to include in the structure section.

  • For structured data sources that store data schema and type information along with the data itself (sources like SQL Server, Oracle, Azure table etc.), you should specify the “structure” section only if you want map source columns to sink columns and their names are not the same.

    As type information is already available for structured data sources, you should not include type information when you do include the “structure” section.

  • For schema on read data sources (specifically Azure blob), you can choose to store data without storing any schema or type information with the data. For these types of data sources, include “structure” when you want to map source columns to sink columns (or) when the dataset is an input dataset for a copy activity and data types of source dataset need to be converted to native types for the sink.

    Data factory supports the following CLS-compliant .NET based type values for providing type information in “structure” for schema on read data sources like Azure blob: Int16, Int32, Int64, Single, Double, Decimal, Byte[], Bool, String, Guid, Datetime, Datetimeoffset, Timespan.

Data Factory automatically performs type conversions when moving data from a source data store to a sink data store.

Dataset Availability

The availability section in a dataset defines the processing window (hourly, daily, weekly etc.) or the slicing model for the dataset. See Scheduling and Execution article for more details on the dataset slicing and dependency model.

The following availability section specifies that the output dataset is either produced hourly (or) input dataset is available hourly:

"availability":    
{    
    "frequency": "Hour",        
    "interval": 1    
}

The following table describes properties you can use in the availability section:

Property Description Required Default
frequency Specifies the time unit for dataset slice production.

Supported frequency: Minute, Hour, Day, Week, Month
Yes NA
interval Specifies a multiplier for frequency

”Frequency x interval” determines how often the slice is produced.

If you need the dataset to be sliced on an hourly basis, you set Frequencyto Hour, and interval to 1.

Note: If you specify Frequency as Minute, we recommend that you set the interval to no less than 15
Yes NA
style Specifies whether the slice should be produced at the start/end of the interval.
  • StartOfInterval
  • EndOfInterval


If Frequency is set to Month and style is set to EndOfInterval, the slice is produced on the last day of month. If the style is set to StartOfInterval, the slice is produced on the first day of month.

If Frequency is set to Day and style is set to EndOfInterval, the slice is produced in the last hour of the day.

If Frequency is set to Hour and style is set to EndOfInterval, the slice is produced at the end of the hour. For example, for a slice for 1 PM – 2 PM period, the slice is produced at 2 PM.
No EndOfInterval
anchorDateTime Defines the absolute position in time used by scheduler to compute dataset slice boundaries.

Note: If the AnchorDateTime has date parts that are more granular than the frequency then the more granular parts are ignored.

For example, if the interval is hourly (frequency: hour and interval: 1) and the AnchorDateTime contains minutes and secondsthen the minutes and seconds parts of the AnchorDateTime are ignored.
No 01/01/0001
offset Timespan by which the start and end of all dataset slices are shifted.

Note: If both anchorDateTime and offset are specified, the result is the combined shift.
No NA

offset example

Daily slices that start at 6 AM instead of the default midnight.

"availability":
{
    "frequency": "Day",
    "interval": 1,
    "offset": "06:00:00"
}

The frequency is set to Day and interval is set to 1 (once a day): If you want the slice to be produced at 6 AM instead of at the default time: 12 AM. Remember that this time is an UTC time.

anchorDateTime example

Example: 23 hours dataset slices that start on 2007-04-19T08:00:00

"availability":    
{    
    "frequency": "Hour",        
    "interval": 23,    
    "anchorDateTime":"2007-04-19T08:00:00"    
}

offset/style Example

If you need dataset on monthly basis on specific date and time (suppose on 3rd of every month at 8:00 AM), use the offset tag to set the date and time it should run.

{
  "name": "MyDataset",
  "properties": {
    "type": "AzureSqlTable",
    "linkedServiceName": "AzureSqlLinkedService",
    "typeProperties": {
      "tableName": "MyTable"
    },
    "availability": {
      "frequency": "Month",
      "interval": 1,
      "offset": "3.08:00:00",
      "style": "StartOfInterval"
    }
  }
}

Dataset Policy

The policy section in dataset definition defines the criteria or the condition that the dataset slices must fulfill.

Validation policies

Policy Name Description Applied To Required Default
minimumSizeMB Validates that the data in an Azure blob meets the minimum size requirements (in megabytes). Azure Blob No NA
minimumRows Validates that the data in an Azure SQL database or an Azure table contains the minimum number of rows.
  • Azure SQL Database
  • Azure Table
No NA

Examples

minimumSizeMB:

"policy":

{
    "validation":
    {
        "minimumSizeMB": 10.0
    }
}

minimumRows

"policy":
{
    "validation":
    {
        "minimumRows": 100
    }
}

External datasets

External datasets are the ones that are not produced by a running pipeline in the data factory. If the dataset is marked as external, the ExternalData policy may be defined to influence the behavior of the dataset slice availability.

Unless a dataset is being produced by Azure Data Factory, it should be marked as external. This setting generally applies to the inputs of first activity in a pipeline unless activity or pipeline chaining is being used.

Name Description Required Default Value
dataDelay Time to delay the check on the availability of the external data for the given slice. For example, if the data is supposed to be available hourly, the check to see the external data is available and the corresponding slice is Ready can be delayed by using dataDelay.

Only applies to the present time. For example, if it is 1:00 PM right now and this value is 10 minutes, the validation starts at 1:10 PM.

This setting does not affect slices in the past (slices with Slice End Time + dataDelay < Now) are processed without any delay.

Time greater than 23:59 hours need to specified using the day.hours:minutes:seconds format. For example, to specify 24 hours, don't use 24:00:00; instead, use 1.00:00:00. If you use 24:00:00, it is treated as 24 days (24.00:00:00). For 1 day and 4 hours, specify 1:04:00:00.
No 0
retryInterval The wait time between a failure and the next retry attempt. Applies to present time; if the previous try failed, we wait this long after the last try.

If it is 1:00pm right now, we begin the first try. If the duration to complete the first validation check is 1 minute and the operation failed, the next retry is at 1:00 + 1min (duration) + 1min (retry interval) = 1:02pm.

For slices in the past, there is no delay. The retry happens immediately.
No 00:01:00 (1 minute)
retryTimeout The timeout for each retry attempt.

If this property is set to 10 minutes, the validation needs to be completed within 10 minutes. If it takes longer than 10 minutes to perform the validation, the retry times out.

If all attempts for the validation times out, the slice is marked as TimedOut.
No 00:10:00 (10 minutes)
maximumRetry Number of times to check for the availability of the external data. The allowed maximum value is 10. No 3

Scoped datasets

You can create datasets that are scoped to a pipeline by using the datasets property. These datasets can only be used by activities within this pipeline but not by activities in other pipelines. The following example defines a pipeline with two datasets - InputDataset-rdc and OutputDataset-rdc - to be used within the pipeline:

Important

Scoped datasets are supported only with one-time pipelines (pipelineMode set to OneTime). See Onetime pipeline for details.

{
    "name": "CopyPipeline-rdc",
    "properties": {
        "activities": [
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "BlobSource",
                        "recursive": false
                    },
                    "sink": {
                        "type": "BlobSink",
                        "writeBatchSize": 0,
                        "writeBatchTimeout": "00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "InputDataset-rdc"
                    }
                ],
                "outputs": [
                    {
                        "name": "OutputDataset-rdc"
                    }
                ],
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1,
                    "style": "StartOfInterval"
                },
                "name": "CopyActivity-0"
            }
        ],
        "start": "2016-02-28T00:00:00Z",
        "end": "2016-02-28T00:00:00Z",
        "isPaused": false,
        "pipelineMode": "OneTime",
        "expirationTime": "15.00:00:00",
        "datasets": [
            {
                "name": "InputDataset-rdc",
                "properties": {
                    "type": "AzureBlob",
                    "linkedServiceName": "InputLinkedService-rdc",
                    "typeProperties": {
                        "fileName": "emp.txt",
                        "folderPath": "adftutorial/input",
                        "format": {
                            "type": "TextFormat",
                            "rowDelimiter": "\n",
                            "columnDelimiter": ","
                        }
                    },
                    "availability": {
                        "frequency": "Day",
                        "interval": 1
                    },
                    "external": true,
                    "policy": {}
                }
            },
            {
                "name": "OutputDataset-rdc",
                "properties": {
                    "type": "AzureBlob",
                    "linkedServiceName": "OutputLinkedService-rdc",
                    "typeProperties": {
                        "fileName": "emp.txt",
                        "folderPath": "adftutorial/output",
                        "format": {
                            "type": "TextFormat",
                            "rowDelimiter": "\n",
                            "columnDelimiter": ","
                        }
                    },
                    "availability": {
                        "frequency": "Day",
                        "interval": 1
                    },
                    "external": false,
                    "policy": {}
                }
            }
        ]
    }
}