Incrementally copy new and changed files based on LastModifiedDate by using the Copy Data tool
In this tutorial, you'll use the Azure portal to create a data factory. Then, you'll use the Copy Data tool to create a pipeline that incrementally copies new and changed files only, based on their LastModifiedDate from Azure Blob storage to Azure Blob storage.
By doing so, ADF will scan all the files from the source store, apply the file filter by their LastModifiedDate, and copy the new and updated file only since last time to the destination store. Please note that if you let ADF scan huge amounts of files but only copy a few files to destination, you would still expect the long duration due to file scanning is time consuming as well.
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
In this tutorial, you will perform the following tasks:
- Create a data factory.
- Use the Copy Data tool to create a pipeline.
- Monitor the pipeline and activity runs.
- Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
- Azure storage account: Use Blob storage as the source and sink data store. If you don't have an Azure storage account, see the instructions in Create a storage account.
Create two containers in Blob storage
Prepare your Blob storage for the tutorial by performing these steps.
Create a container named source. You can use various tools to perform this task, such as Azure Storage Explorer.
Create a container named destination.
Create a data factory
On the left menu, select Create a resource > Data + Analytics > Data Factory:
On the New data factory page, under Name, enter ADFTutorialDataFactory.
The name for your data factory must be globally unique. You might receive the following error message:
If you receive an error message about the name value, enter a different name for the data factory. For example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory artifacts, see Data Factory naming rules.
Select the Azure subscription in which you'll create the new data factory.
For Resource Group, take one of the following steps:
Select Use existing and select an existing resource group from the drop-down list.
Select Create new and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
Under version, select V2.
Under location, select the location for the data factory. Only supported locations are displayed in the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example, Azure HDInsight) that your data factory uses can be in other locations and regions.
Select Pin to dashboard.
On the dashboard, refer to the Deploying Data Factory tile to see the process status.
After creation is finished, the Data Factory home page is displayed.
To open the Azure Data Factory user interface (UI) on a separate tab, select the Author & Monitor tile.
Use the Copy Data tool to create a pipeline
On the Let's get started page, select the Copy Data title to open the Copy Data tool.
On the Properties page, take the following steps:
a. Under Task name, enter DeltaCopyFromBlobPipeline.
b. Under Task cadence or Task schedule, select Run regularly on schedule.
c. Under Trigger Type, select Tumbling Window.
d. Under Recurrence, enter 15 Minute(s).
e. Select Next.
The Data Factory UI creates a pipeline with the specified task name.
On the Source data store page, complete the following steps:
a. Select + Create new connection, to add a connection.
b. Select Azure Blob Storage from the gallery, and then select Continue.
c. On the New Linked Service page, select your storage account from the Storage account name list and then select Finish.
d. Select the newly created linked service and then select Next.
On the Choose the input file or folder page, complete the following steps:
a. Browse and select the source folder, and then select Choose.
b. Under File loading behavior, select Incremental load: LastModifiedDate.
c. Check Binary copy and select Next.
On the Destination data store page, select AzureBlobStorage. This is the same storage account as the source data store. Then select Next.
On the Choose the output file or folder page, complete the following steps:
a. Browse and select the destination folder, and then select Choose.
b. Select Next.
On the Settings page, select Next.
On the Summary page, review the settings and then select Next.
On the Deployment page, select Monitor to monitor the pipeline (task).
Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view activity run details and to rerun the pipeline. Select Refresh to refresh the list, and select the View Activity Runs link in the Actions column.
There's only one activity (the copy activity) in the pipeline, so you see only one entry. For details about the copy operation, select the Details link (eyeglasses icon) in the Actions column.
Because there is no file in the source container in your Blob storage account, you will not see any file copied to the destination container in your Blob storage account.
Create an empty text file and name it file1.txt. Upload this text file to the source container in your storage account. You can use various tools to perform these tasks, such as Azure Storage Explorer.
To go back to the Pipeline Runs view, select All Pipeline Runs, and wait for the same pipeline to be triggered again automatically.
Select View Activity Run for the second pipeline run when you see it. Then review the details in the same way you did for the first pipeline run.
You will that see one file (file1.txt) has been copied from the source container to the destination container of your Blob storage account.
Create another empty text file and name it file2.txt. Upload this text file to the source container in your Blob storage account.
Repeat steps 13 and 14 for this second text file. You will see that only the new file (file2.txt) has been copied from the source container to the destination container of your storage account in the next pipeline run.
You can also verify this by using Azure Storage Explorer to scan the files.
Advance to the following tutorial to learn about transforming data by using an Apache Spark cluster on Azure: