Create a data factory by using the Azure Data Factory UI

This quickstart describes how to use the Azure Data Factory UI to create and monitor a data factory. The pipeline that you create in this data factory copies data from one folder to another folder in Azure Blob storage. For a tutorial on how to transform data by using Azure Data Factory, see Tutorial: Transform data by using Spark.

Note

If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.

This article applies to version 2 of Data Factory, which is currently in preview. If you are using version 1 of the service, which is in general availability (GA), see the Data Factory version 1 tutorial.

Prerequisites

Azure subscription

If you don't have an Azure subscription, create a free account before you begin.

Azure roles

To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the contributor or owner role, or an administrator of the Azure subscription. In the Azure portal, select your username in the upper-right corner, and select Permissions to view the permissions that you have in the subscription. If you have access to multiple subscriptions, select the appropriate subscription. For sample instructions on adding a user to a role, see the Add roles article.

Azure storage account

You use a general-purpose Azure storage account (specifically Blob storage) as both source and destination data stores in this quickstart. If you don't have a general-purpose Azure storage account, see Create a storage account to create one.

Get the storage account name and account key

You use the name and key of your Azure storage account in this quickstart. The following procedure provides steps to get the name and key of your storage account:

  1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
  2. Select More services on the left menu, filter with the Storage keyword, and select Storage accounts.

    Search for a storage account

  3. In the list of storage accounts, filter for your storage account (if needed), and then select your storage account.
  4. On the Storage account page, select Access keys on the menu.

    Get storage account name and key

  5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them into Notepad or any other editor and save it. You use them later in this quickstart.

Create the input folder and files

In this section, you create a blob container named adftutorial in Azure Blob storage. You create a folder named input in the container, and then upload a sample file to the input folder.

  1. On the Storage account page, switch to Overview, and then select Blobs.

    Select Blobs option

  2. On the Blob service page, select + Container on the toolbar.

    Add container button

  3. In the New container dialog box, enter adftutorial for the name, and then select OK.

    Enter container name

  4. Select adftutorial in the list of containers.

    Select the container

  5. On the Container page, select Upload on the toolbar.

    Upload button

  6. On the Upload blob page, select Advanced.

    Select Advanced link

  7. Start Notepad and create a file named emp.txt with the following content. Save it in the c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not already exist.

    John, Doe
    Jane, Doe
    
  8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the Files box.
  9. Enter input as a value for the Upload to folder box.

    Upload blob settings

  10. Confirm that the folder is input and the file is emp.txt, and select Upload.

    You should see the emp.txt file and the status of the upload in the list.

  11. Close the Upload blob page by clicking X in the corner.

    Close upload blob page

  12. Keep the Container page open. You use it to verify the output at the end of this quickstart.

Video

Watching this video helps you understand the Data Factory UI:

Create a data factory

  1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in Microsoft Edge and Google Chrome web browsers.
  2. Go to the Azure portal.
  3. Select New on the left menu, select Data + Analytics, and then select Data Factory.

    Data Factory selection in the "New" pane

  4. On the New data factory page, enter ADFTutorialDataFactory for Name.

    "New data factory" page

    The name of the Azure data factory must be globally unique. If you see the following error, change the name of the data factory (for example, <yourname>ADFTutorialDataFactory) and try creating again. For naming rules for Data Factory artifacts, see the Data Factory - naming rules article.

    Error when a name is not available

  5. For Subscription, select your Azure subscription in which you want to create the data factory.
  6. For Resource Group, use one of the following steps:

    • Select Use existing, and select an existing resource group from the list.
    • Select Create new, and enter the name of a resource group.

    To learn about resource groups, see Using resource groups to manage your Azure resources.

  7. For Version, select V2 (Preview).
  8. For Location, select the location for the data factory.

    The list shows only locations that Data Factory supports. The data stores (like Azure Storage and Azure SQL Database) and computes (like Azure HDInsight) that Data Factory uses can be in other locations.

  9. Select Pin to dashboard.
  10. Select Create.
  11. On the dashboard, you see the following tile with the status Deploying Data Factory:

    "Deploying Data Factory" tile

  12. After the creation is complete, you see the Data Factory page. Select the Author & Monitor tile to start the Azure Data Factory user interface (UI) application on a separate tab.

    Home page for the data factory, with the "Author & Monitor" tile

  13. On the Let's get started page, switch to the Edit tab in the left panel.

    "Let's get started" page

Create a linked service

In this procedure, you create a linked service to link your Azure storage account to the data factory. The linked service has the connection information that the Data Factory service uses at runtime to connect to it.

  1. Select Connections, and then select the New button on the toolbar.

    Buttons for creating a new connection

  2. On the New Linked Service page, select Azure Blob Storage, and then select Continue.

    Selecting the "Azure Blob Storage" tile

  3. Complete the following steps:

    a. For Name, enter AzureStorageLinkedService.

    b. For Storage account name, select the name of your Azure storage account.

    c. Select Test connection to confirm that the Data Factory service can connect to the storage account.

    d. Select Save to save the linked service.

    Azure Storage linked service settings

  4. Confirm that you see AzureStorageLinkedService in the list of linked services.

    Azure Storage linked service in the list

Create datasets

In this procedure, you create two datasets: InputDataset and OutputDataset. These datasets are of type AzureBlob. They refer to the Azure Storage linked service that you created in the previous section.

The input dataset represents the source data in the input folder. In the input dataset definition, you specify the blob container (adftutorial), the folder (input), and the file (emp.txt) that contain the source data.

The output dataset represents the data that's copied to the destination. In the output dataset definition, you specify the blob container (adftutorial), the folder (output), and the file to which the data is copied. Each run of a pipeline has a unique ID associated with it. You can access this ID by using the system variable RunId. The name of the output file is dynamically evaluated based on the run ID of the pipeline.

In the linked service settings, you specified the Azure storage account that contains the source data. In the source dataset settings, you specify where exactly the source data resides (blob container, folder, and file). In the sink dataset settings, you specify where the data is copied to (blob container, folder, and file).

  1. Select the + (plus) button, and then select Dataset.

    Menu for creating a dataset

  2. On the New Dataset page, select Azure Blob Storage, and then select Finish.

    Selecting "Azure Blob Storage"

  3. In the Properties window for the dataset, enter InputDataset for Name.

    Dataset general settings

  4. Switch to the Connection tab and complete the following steps:

    a. For Linked service, select AzureStorageLinkedService.

    b. For File path, select the Browse button.

    "Connection" tab and "Browse" button c. In the Choose a file or folder window, browse to the input folder in the adftutorial container, select the emp.txt file, and then select Finish.

    Browse for the input file

    d. (optional) Select Preview data to preview the data in the emp.txt file.

  5. Repeat the steps to create the output dataset:

    a. Select the + (plus) button, and then select Dataset.

    b. On the New Dataset page, select Azure Blob Storage, and then select Finish.

    c. Specify OutputDataset for the name.

    d. Enter adftutorial/output for the folder. If the output folder does not exist, the copy activity creates it at runtime.

    e. Enter @CONCAT(pipeline().RunId, '.txt') for the file name.

    Each time you run a pipeline, the pipeline run has a unique ID associated with it. The expression concatenates the run ID of the pipeline with .txt to evaluate the output file name. For the supported list of system variables and expressions, see System variables and Expression language.

    Output dataset settings

Create a pipeline

In this procedure, you create and validate a pipeline with a copy activity that uses the input and output datasets. The copy activity copies data from the file you specified in the input dataset settings to the file you specified in the output dataset settings. If the input dataset specifies only a folder (not the file name), the copy activity copies all the files in the source folder to the destination.

  1. Select the + (plus) button, and then select Pipeline.

    Menu for creating a new pipeline

  2. In the Properties window, specify CopyPipeline for Name.

    Pipeline general settings

  3. In the Activities toolbox, expand Data Flow. Drag the Copy activity from the Activities toolbox to the pipeline designer surface. You can also search for activities in the Activities toolbox. Specify CopyFromBlobToBlob for Name.

    Copy activity general settings

  4. Switch to the Source tab in the copy activity settings, and select InputDataset for Source Dataset.

    Copy activity source settings

  5. Switch to the Sink tab in the copy activity settings, and select OutputDataset for Sink Dataset.

    Copy activity sink settings

  6. Select Validate to validate the pipeline settings. Confirm that the pipeline has been successfully validated. To close the validation output, select the >> (right arrow) button.

    Validate the pipeline

Test run the pipeline

In this step, you test run the pipeline before deploying it to Data Factory.

  1. On the toolbar for the pipeline, select Test Run.

    Pipeline test runs

  2. Confirm that you see the status of the pipeline run on the Output tab of the pipeline settings.

    Test run output

  3. Confirm that you see an output file in the output folder of the adftutorial container. If the output folder does not exist, the Data Factory service automatically creates it.

    Verify output

Trigger the pipeline manually

In this procedure, you deploy entities (linked services, datasets, pipelines) to Azure Data Factory. Then, you manually trigger a pipeline run. You can also publish entities to your own Visual Studio Team Services Git repository, which is covered in another tutorial.

  1. Before you trigger a pipeline, you must publish entities to Data Factory. To publish, select Publish All in the left pane.

    Publish button

  2. To trigger the pipeline manually, select Trigger on the toolbar, and then select Trigger Now.

    "Trigger Now" command

Monitor the pipeline

  1. Switch to the Monitor tab on the left. Use the Refresh button to refresh the list.

    Tab for monitoring pipeline runs, with "Refresh" button

  2. Select the View Activity Runs link under Actions. You see the status of the copy activity run on this page.

    Pipeline activity runs

  3. To view details about the copy operation, select the Details (eyeglasses image) link in the Actions column. For details about the properties, see Copy Activity overview.

    Copy operation details

  4. Confirm that you see a new file in the output folder.
  5. You can switch back to the Pipeline Runs view from the Activity Runs view by selecting the Pipelines link.

Trigger the pipeline on a schedule

This procedure is optional in this tutorial. You can create a scheduler trigger to schedule the pipeline to run periodically (hourly, daily, and so on). In this procedure, you create a trigger to run every minute until the end date and time that you specify.

  1. Switch to the Edit tab.

    Edit button

  2. Select Trigger on the menu, and then select New/Edit.

    Menu for new trigger

  3. On the Add Triggers page, select Choose trigger, and then select New.

    Selections for adding a new trigger

  4. On the New Trigger page, under End, select On Date, specify an end time a few minutes after the current time, and then select Apply.

    A cost is associated with each pipeline run, so specify the end time only minutes apart from the start time. Ensure that it's the same day. However, ensure that there is enough time for the pipeline to run between the publish time and the end time. The trigger comes into effect only after you publish the solution to Data Factory, not when you save the trigger in the UI.

    Trigger settings

  5. On the New Trigger page, select the Activated check box, and then select Next.

    "Activated" check box and "Next" button

  6. Review the warning message, and select Finish.

    Warning and "Finish" button

  7. Select Publish All to publish changes to Data Factory.

    Publish button

  8. Switch to the Monitor tab on the left. Select Refresh to refresh the list. You see that the pipeline runs once every minute from the publish time to the end time.

    Notice the values in the Triggered By column. The manual trigger run was from the step (Trigger Now) that you did earlier.

    List of triggered runs

  9. Select the down arrow next to Pipeline Runs to switch to the Trigger Runs view.

    Switch to "Trigger Runs" view

  10. Confirm that an output file is created for every pipeline run until the specified end date and time in the output folder.

Next steps

The pipeline in this sample copies data from one location to another location in Azure Blob storage. To learn about using Data Factory in more scenarios, go through the tutorials.