Copy data to or from Azure Blob Storage using Azure Data Factory

This article explains how to use the Copy Activity in Azure Data Factory to copy data to/from Azure Blob Storage. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity.

You can copy data from any supported source data store to Azure Blob Storage or from Azure Blob Storage to any supported sink data store. For a list of data stores supported as sources or sinks by the copy activity, see the Supported data stores table.

Important

Copy Activity supports copying data from/to both general-purpose Azure Storage accounts and Hot/Cool Blob storage. The activity supports reading from block, append, or page blobs, but supports writing to only block blobs. Azure Premium Storage is not supported as a sink because it is backed by page blobs.

Copy Activity does not delete data from the source after the data is successfully copied to the destination. If you need to delete source data after a successful copy, create a custom activity to delete the data and use the activity in the pipeline.

Getting started

You can create a pipeline with a copy activity that moves data to/from an Azure Blob Storage by using different tools/APIs.

The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.

You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity.

Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store:

  1. Create linked services to link input and output data stores to your data factory.
  2. Create datasets to represent input and output data for the copy operation.
  3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.

When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure Blob Storage, see JSON examples section of this article.

The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure Blob Storage:

Linked service properties

There are two types of linked services you can use to link an Azure Storage to an Azure data factory. They are: AzureStorage linked service and AzureStorageSas linked service. The Azure Storage linked service provides the data factory with global access to the Azure Storage. Whereas, The Azure Storage SAS (Shared Access Signature) linked service provides the data factory with restricted/time-bound access to the Azure Storage. There are no other differences between these two linked services. Choose the linked service that suits your needs. The following sections provide more details on these two linked services.

Azure Storage Linked Service

The Azure Storage linked service allows you to link an Azure storage account to an Azure data factory by using the account key. This provides the data factory with global access to the Azure Storage. The following table provides description for JSON elements specific to Azure Storage linked service.

Property Description Required
type The type property must be set to: AzureStorage Yes
connectionString Specify information needed to connect to Azure storage for the connectionString property. Yes

See the following article for steps to view/copy the account key for an Azure Storage: View, copy, and regenerate storage access keys.

Example:

{  
    "name": "StorageLinkedService",  
    "properties": {  
        "type": "AzureStorage",  
        "typeProperties": {  
            "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"  
        }  
    }  
}  

Azure Storage Sas Linked Service

A shared access signature (SAS) provides delegated access to resources in your storage account. This means that you can grant a client limited permissions to objects in your storage account for a specified period of time and with a specified set of permissions, without having to share your account access keys. The SAS is a URI that encompasses in its query parameters all of the information necessary for authenticated access to a storage resource. To access storage resources with the SAS, the client only needs to pass in the SAS to the appropriate constructor or method. For detailed information about SAS, see Shared Access Signatures: Understanding the SAS Model

The Azure Storage SAS linked service allows you to link an Azure Storage Account to an Azure data factory by using a Shared Access Signature (SAS). This provides the data factory with restricted/time-bound access to all/specific resources (blob/container) in the storage. The following table provides description for JSON elements specific to Azure Storage SAS linked service.

Property Description Required
type The type property must be set to: AzureStorageSas Yes
sasUri Specify Shared Access Signature URI to the Azure Storage resources such as blob, container, or table. See the notes below for details. Yes

Example:

{  
    "name": "StorageSasLinkedService",  
    "properties": {  
        "type": "AzureStorageSas",  
        "typeProperties": {  
            "sasUri": "<storageUri>?<sasToken>"   
        }  
    }  
}  

When creating an SAS URI, considering the following:

  • Azure Data Factory supports only Service SAS, not Account SAS. See Types of Shared Access Signatures for details about these two types.
  • Appropriate read/write permissions need to be set on objects based on how the linked service (read, write, read/write) will be used in your data factory.
  • Expiry time needs to be set appropriately. Make sure that the access to Azure Storage objects does not expire within the active period of the pipeline.
  • Uri should be created at the right container/blob or Table level based on the need. A SAS Uri to an Azure blob allows the Data Factory service to access that particular blob. A SAS Uri to an Azure blob container allows the Data Factory service to iterate through blobs in that container. If you need to provide access more/fewer objects later, or update the SAS URI, remember to update the linked service with the new URI.

Dataset properties

To specify a dataset to represent input or output data in an Azure Blob Storage, you set the type property of the dataset to: AzureBlob. Set the linkedServiceName property of the dataset to the name of the Azure Storage or Azure Storage SAS linked service. The type properties of the dataset specify the blob container and the folder in the blob storage.

For a full list of JSON sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.).

Data factory supports the following CLS-compliant .NET based type values for providing type information in “structure” for schema-on-read data sources like Azure blob: Int16, Int32, Int64, Single, Double, Decimal, Byte[], Bool, String, Guid, Datetime, Datetimeoffset, Timespan. Data Factory automatically performs type conversions when moving data from a source data store to a sink data store.

The typeProperties section is different for each type of dataset and provides information about the location, format etc., of the data in the data store. The typeProperties section for dataset of type AzureBlob dataset has the following properties:

Property Description Required
folderPath Path to the container and folder in the blob storage. Example: myblobcontainer\myblobfolder\ Yes
fileName Name of the blob. fileName is optional and case-sensitive.

If you specify a filename, the activity (including Copy) works on the specific Blob.

When fileName is not specified, Copy includes all Blobs in the folderPath for input dataset.

When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: Data..txt (for example: : Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt
No
partitionedBy partitionedBy is an optional property. You can use it to specify a dynamic folderPath and filename for time series data. For example, folderPath can be parameterized for every hour of data. See the Using partitionedBy property section for details and examples. No
format The following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, ParquetFormat. Set the type property under format to one of these values. For more information, see Text Format, Json Format, Avro Format, Orc Format, and Parquet Format sections.

If you want to copy files as-is between file-based stores (binary copy), skip the format section in both input and output dataset definitions.
No
compression Specify the type and level of compression for the data. Supported types are: GZip, Deflate, BZip2, and ZipDeflate. Supported levels are: Optimal and Fastest. For more information, see File and compression formats in Azure Data Factory. No

Using partitionedBy property

As mentioned in the previous section, you can specify a dynamic folderPath and filename for time series data with the partitionedBy property, Data Factory functions, and the system variables.

For more information on time series datasets, scheduling, and slices, see Creating Datasets and Scheduling & Execution articles.

Sample 1

"folderPath": "wikidatagateway/wikisampledataout/{Slice}",
"partitionedBy":
[
    { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } },
],

In this example, {Slice} is replaced with the value of Data Factory system variable SliceStart in the format (YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each slice. For example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104

Sample 2

"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
 [
    { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
    { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
    { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
    { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],

In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by folderPath and fileName properties.

Copy activity properties

For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output datasets, and policies are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. If you are moving data from an Azure Blob Storage, you set the source type in the copy activity to BlobSource. Similarly, if you are moving data to an Azure Blob Storage, you set the sink type in the copy activity to BlobSink. This section provides a list of properties supported by BlobSource and BlobSink.

BlobSource supports the following properties in the typeProperties section:

Property Description Allowed values Required
recursive Indicates whether the data is read recursively from the sub folders or only from the specified folder. True (default value), False No

BlobSink supports the following properties typeProperties section:

Property Description Allowed values Required
copyBehavior Defines the copy behavior when the source is BlobSource or FileSystem. PreserveHierarchy: preserves the file hierarchy in the target folder. The relative path of source file to source folder is identical to the relative path of target file to target folder.

FlattenHierarchy: all files from the source folder are in the first level of target folder. The target files have auto generated name.

MergeFiles: merges all files from the source folder to one file. If the File/Blob Name is specified, the merged file name would be the specified name; otherwise, would be auto-generated file name.
No

BlobSource also supports these two properties for backward compatibility.

  • treatEmptyAsNull: Specifies whether to treat null or empty string as null value.
  • skipHeaderLineCount - Specifies how many lines need be skipped. It is applicable only when input dataset is using TextFormat.

Similarly, BlobSink supports the following property for backward compatibility.

  • blobWriterAddHeader: Specifies whether to add a header of column definitions while writing to an output dataset.

Datasets now support the following properties that implement the same functionality: treatEmptyAsNull, skipLineCount, firstRowAsHeader.

The following table provides guidance on using the new dataset properties in place of these blob source/sink properties.

Copy Activity property Dataset property
skipHeaderLineCount on BlobSource skipLineCount and firstRowAsHeader. Lines are skipped first and then the first row is read as a header.
treatEmptyAsNull on BlobSource treatEmptyAsNull on input dataset
blobWriterAddHeader on BlobSink firstRowAsHeader on output dataset

See Specifying TextFormat section for detailed information on these properties.

recursive and copyBehavior examples

This section describes the resulting behavior of the Copy operation for different combinations of recursive and copyBehavior values.

recursive copyBehavior Resulting behavior
true preserveHierarchy For a source folder Folder1 with the following structure:

Folder1
    File1
    File2
    Subfolder1
        File3
        File4
        File5

the target folder Folder1 is created with the same structure as the source

Folder1
    File1
    File2
    Subfolder1
        File3
        File4
        File5.
true flattenHierarchy For a source folder Folder1 with the following structure:

Folder1
    File1
    File2
    Subfolder1
        File3
        File4
        File5

the target Folder1 is created with the following structure:

Folder1
    auto-generated name for File1
    auto-generated name for File2
    auto-generated name for File3
    auto-generated name for File4
    auto-generated name for File5
true mergeFiles For a source folder Folder1 with the following structure:

Folder1
    File1
    File2
    Subfolder1
        File3
        File4
        File5

the target Folder1 is created with the following structure:

Folder1
    File1 + File2 + File3 + File4 + File 5 contents are merged into one file with auto-generated file name
false preserveHierarchy For a source folder Folder1 with the following structure:

Folder1
    File1
    File2
    Subfolder1
        File3
        File4
        File5

the target folder Folder1 is created with the following structure

Folder1
    File1
    File2


Subfolder1 with File3, File4, and File5 are not picked up.
false flattenHierarchy For a source folder Folder1 with the following structure:

Folder1
    File1
    File2
    Subfolder1
        File3
        File4
        File5

the target folder Folder1 is created with the following structure

Folder1
    auto-generated name for File1
    auto-generated name for File2


Subfolder1 with File3, File4, and File5 are not picked up.
false mergeFiles For a source folder Folder1 with the following structure:

Folder1
    File1
    File2
    Subfolder1
        File3
        File4
        File5

the target folder Folder1 is created with the following structure

Folder1
    File1 + File2 contents are merged into one file with auto-generated file name. auto-generated name for File1

Subfolder1 with File3, File4, and File5 are not picked up.

Supported file and compression formats

See File and compression formats in Azure Data Factory article on details.

JSON examples

The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data to and from Azure Blob Storage and Azure SQL Database. However, data can be copied directly from any of sources to any of the sinks stated here using the Copy Activity in Azure Data Factory.

Example: Copy data from Blob Storage to SQL Database

The following sample shows:

  1. A linked service of type AzureSqlDatabase.
  2. A linked service of type AzureStorage.
  3. An input dataset of type AzureBlob.
  4. An output dataset of type AzureSqlTable.
  5. A pipeline with a Copy activity that uses BlobSource and SqlSink.

The sample copies time-series data from an Azure blob to an Azure SQL table hourly. The JSON properties used in these samples are described in sections following the samples.

Azure SQL linked service:

{
  "name": "AzureSqlLinkedService",
  "properties": {
    "type": "AzureSqlDatabase",
    "typeProperties": {
      "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
    }
  }
}

Azure Storage linked service:

{
  "name": "StorageLinkedService",
  "properties": {
    "type": "AzureStorage",
    "typeProperties": {
      "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
    }
  }
}

Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and AzureStorageSas. For the first one, you specify the connection string that includes the account key and for the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details.

Azure Blob input dataset:

Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. “external”: “true” setting informs Data Factory that the table is external to the data factory and is not produced by an activity in the data factory.

{
  "name": "AzureBlobInput",
  "properties": {
    "type": "AzureBlob",
    "linkedServiceName": "StorageLinkedService",
    "typeProperties": {
      "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/",
      "fileName": "{Hour}.csv",
      "partitionedBy": [
        {
          "name": "Year",
          "value": {
            "type": "DateTime",
            "date": "SliceStart",
            "format": "yyyy"
          }
        },
        {
          "name": "Month",
          "value": {
            "type": "DateTime",
            "date": "SliceStart",
            "format": "MM"
          }
        },
        {
          "name": "Day",
          "value": {
            "type": "DateTime",
            "date": "SliceStart",
            "format": "dd"
          }
        },
        {
          "name": "Hour",
          "value": {
            "type": "DateTime",
            "date": "SliceStart",
            "format": "HH"
          }
        }
      ],
      "format": {
        "type": "TextFormat",
        "columnDelimiter": ",",
        "rowDelimiter": "\n"
      }
    },
    "external": true,
    "availability": {
      "frequency": "Hour",
      "interval": 1
    },
    "policy": {
      "externalData": {
        "retryInterval": "00:01:00",
        "retryTimeout": "00:10:00",
        "maximumRetry": 3
      }
    }
  }
}

Azure SQL output dataset:

The sample copies data to a table named “MyTable” in an Azure SQL database. Create the table in your Azure SQL database with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour.

{
  "name": "AzureSqlOutput",
  "properties": {
    "type": "AzureSqlTable",
    "linkedServiceName": "AzureSqlLinkedService",
    "typeProperties": {
      "tableName": "MyOutputTable"
    },
    "availability": {
      "frequency": "Hour",
      "interval": 1
    }
  }
}

A copy activity in a pipeline with Blob source and SQL sink:

The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set to SqlSink.

{  
    "name":"SamplePipeline",
    "properties":{  
    "start":"2014-06-01T18:00:00",
    "end":"2014-06-01T19:00:00",
    "description":"pipeline with copy activity",
    "activities":[  
      {
        "name": "AzureBlobtoSQL",
        "description": "Copy Activity",
        "type": "Copy",
        "inputs": [
          {
            "name": "AzureBlobInput"
          }
        ],
        "outputs": [
          {
            "name": "AzureSqlOutput"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "BlobSource"
          },
          "sink": {
            "type": "SqlSink"
          }
        },
       "scheduler": {
          "frequency": "Hour",
          "interval": 1
        },
        "policy": {
          "concurrency": 1,
          "executionPriorityOrder": "OldestFirst",
          "retry": 0,
          "timeout": "01:00:00"
        }
      }
      ]
   }
}

Example: Copy data from Azure SQL to Azure Blob

The following sample shows:

  1. A linked service of type AzureSqlDatabase.
  2. A linked service of type AzureStorage.
  3. An input dataset of type AzureSqlTable.
  4. An output dataset of type AzureBlob.
  5. A pipeline with Copy activity that uses SqlSource and BlobSink.

The sample copies time-series data from an Azure SQL table to an Azure blob hourly. The JSON properties used in these samples are described in sections following the samples.

Azure SQL linked service:

{
  "name": "AzureSqlLinkedService",
  "properties": {
    "type": "AzureSqlDatabase",
    "typeProperties": {
      "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
    }
  }
}

Azure Storage linked service:

{
  "name": "StorageLinkedService",
  "properties": {
    "type": "AzureStorage",
    "typeProperties": {
      "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
    }
  }
}

Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and AzureStorageSas. For the first one, you specify the connection string that includes the account key and for the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details.

Azure SQL input dataset:

The sample assumes you have created a table “MyTable” in Azure SQL and it contains a column called “timestampcolumn” for time series data.

Setting “external”: ”true” informs Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory.

{
  "name": "AzureSqlInput",
  "properties": {
    "type": "AzureSqlTable",
    "linkedServiceName": "AzureSqlLinkedService",
    "typeProperties": {
      "tableName": "MyTable"
    },
    "external": true,
    "availability": {
      "frequency": "Hour",
      "interval": 1
    },
    "policy": {
      "externalData": {
        "retryInterval": "00:01:00",
        "retryTimeout": "00:10:00",
        "maximumRetry": 3
      }
    }
  }
}

Azure Blob output dataset:

Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time.

{
  "name": "AzureBlobOutput",
  "properties": {
    "type": "AzureBlob",
    "linkedServiceName": "StorageLinkedService",
    "typeProperties": {
      "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/",
      "partitionedBy": [
        {
          "name": "Year",
          "value": {
            "type": "DateTime",
            "date": "SliceStart",
            "format": "yyyy"
          }
        },
        {
          "name": "Month",
          "value": {
            "type": "DateTime",
            "date": "SliceStart",
            "format": "MM"
          }
        },
        {
          "name": "Day",
          "value": {
            "type": "DateTime",
            "date": "SliceStart",
            "format": "dd"
          }
        },
        {
          "name": "Hour",
          "value": {
            "type": "DateTime",
            "date": "SliceStart",
            "format": "HH"
          }
        }
      ],
      "format": {
        "type": "TextFormat",
        "columnDelimiter": "\t",
        "rowDelimiter": "\n"
      }
    },
    "availability": {
      "frequency": "Hour",
      "interval": 1
    }
  }
}

A copy activity in a pipeline with SQL source and Blob sink:

The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to SqlSource and sink type is set to BlobSink. The SQL query specified for the SqlReaderQuery property selects the data in the past hour to copy.

{  
    "name":"SamplePipeline",
    "properties":{  
        "start":"2014-06-01T18:00:00",
        "end":"2014-06-01T19:00:00",
        "description":"pipeline for copy activity",
        "activities":[  
              {
                "name": "AzureSQLtoBlob",
                "description": "copy activity",
                "type": "Copy",
                "inputs": [
                  {
                    "name": "AzureSQLInput"
                  }
                ],
                "outputs": [
                  {
                    "name": "AzureBlobOutput"
                  }
                ],
                "typeProperties": {
                    "source": {
                        "type": "SqlSource",
                        "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
                      },
                      "sink": {
                        "type": "BlobSink"
                      }
                },
                   "scheduler": {
                      "frequency": "Hour",
                      "interval": 1
                },
                "policy": {
                      "concurrency": 1,
                      "executionPriorityOrder": "OldestFirst",
                      "retry": 0,
                      "timeout": "01:00:00"
                }
              }
         ]
    }
}
Note

To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory.

Performance and Tuning

See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.