Tutorial: Build your first Azure data factory using Azure portal

In this article, you learn how to use Azure portal to create your first Azure data factory. To do the tutorial using other tools/SDKs, select one of the options from the drop-down list.

The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a month between the specified start and end times.

Note

The data pipeline in this tutorial transforms input data to produce output data. For a tutorial on how to copy data using Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database.

A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution in Data Factory.

Prerequisites

  1. Read through Tutorial Overview article and complete the prerequisite steps.
  2. This article does not provide a conceptual overview of the Azure Data Factory service. We recommend that you go through Introduction to Azure Data Factory article for a detailed overview of the service.

Create data factory

A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform input data to product output data. Let's start with creating the data factory in this step.

  1. Log in to the Azure portal.
  2. Click NEW on the left menu, click Data + Analytics, and click Data Factory.

    Create blade

  3. In the New data factory blade, enter GetStartedDF for the Name.

    New data factory blade

    Important

    The name of the Azure data factory must be globally unique. If you receive the error: Data factory name “GetStartedDF” is not available. Change the name of the data factory (for example, yournameGetStartedDF) and try creating again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.

    The name of the data factory may be registered as a DNS name in the future and hence become publically visible.

  4. Select the Azure subscription where you want the data factory to be created.
  5. Select existing resource group or create a resource group. For the tutorial, create a resource group named: ADFGetStartedRG.
  6. Select the location for the data factory. Only regions supported by the Data Factory service are shown in the drop-down list.
  7. Select Pin to dashboard.
  8. Click Create on the New data factory blade.

    Important

    To create Data Factory instances, you must be a member of the Data Factory Contributor role at the subscription/resource group level.

  9. On the dashboard, you see the following tile with status: Deploying data factory.

    Creating data factory status

  10. Congratulations! You have successfully created your first data factory. After the data factory has been created successfully, you see the data factory page, which shows you the contents of the data factory.

    Data Factory blade

Before creating a pipeline in the data factory, you need to create a few Data Factory entities first. You first create linked services to link data stores/computes to your data store, define input and output datasets to represent input/output data in linked data stores, and then create the pipeline with an activity that uses these datasets.

Create linked services

In this step, you link your Azure Storage account and an on-demand Azure HDInsight cluster to your data factory. The Azure Storage account holds the input and output data for the pipeline in this sample. The HDInsight linked service is used to run a Hive script specified in the activity of the pipeline in this sample. Identify what data store/compute services are used in your scenario and link those services to the data factory by creating linked services.

Create Azure Storage linked service

In this step, you link your Azure Storage account to your data factory. In this tutorial, you use the same Azure Storage account to store input/output data and the HQL script file.

  1. Click Author and deploy on the DATA FACTORY blade for GetStartedDF. You should see the Data Factory Editor.

    Author and deploy tile

  2. Click New data store and choose Azure storage.

    New data store - Azure Storage - menu

  3. You should see the JSON script for creating an Azure Storage linked service in the editor.

    Azure Storage linked service

  4. Replace account name with the name of your Azure storage account and account key with the access key of the Azure storage account. To learn how to get your storage access key, see the information about how to view, copy, and regenerate storage access keys in Manage your storage account.
  5. Click Deploy on the command bar to deploy the linked service.

    Deploy button

    After the linked service is deployed successfully, the Draft-1 window should disappear and you see AzureStorageLinkedService in the tree view on the left.

    Storage Linked Service in menu

Create Azure HDInsight linked service

In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically created at runtime and deleted after it is done processing and idle for the specified amount of time.

  1. In the Data Factory Editor, click ... More, click New compute, and select On-demand HDInsight cluster.

    New compute

  2. Copy and paste the following snippet to the Draft-1 window. The JSON snippet describes the properties that are used to create the HDInsight cluster on-demand.

    {
        "name": "HDInsightOnDemandLinkedService",
        "properties": {
            "type": "HDInsightOnDemand",
            "typeProperties": {
                "version": "3.5",
                "clusterSize": 1,
                "timeToLive": "00:05:00",
                "osType": "Linux",
                "linkedServiceName": "AzureStorageLinkedService"
            }
        }
    }
    

    The following table provides descriptions for the JSON properties used in the snippet:

    Property Description
    ClusterSize Specifies the size of the HDInsight cluster.
    TimeToLive Specifies that the idle time for the HDInsight cluster, before it is deleted.
    linkedServiceName Specifies the storage account that is used to store the logs that are generated by HDInsight.

    Note the following points:

    • The Data Factory creates a Linux-based HDInsight cluster for you with the JSON. See On-demand HDInsight Linked Service for details.
    • You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See HDInsight Linked Service for details.
    • The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the processing is done.

      As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: "adfyourdatafactoryname-linkedservicename-datetimestamp". Use tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage.

      See On-demand HDInsight Linked Service for details.

  3. Click Deploy on the command bar to deploy the linked service.

    Deploy on-demand HDInsight linked service

  4. Confirm that you see both AzureStorageLinkedService and HDInsightOnDemandLinkedService in the tree view on the left.

    Tree view with linked services

Create datasets

In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to the AzureStorageLinkedService you have created earlier in this tutorial. The linked service points to an Azure Storage account and datasets specify container, folder, file name in the storage that holds input and output data.

Create input dataset

  1. In the Data Factory Editor, click ... More on the command bar, click New dataset, and select Azure Blob storage.

    New dataset

  2. Copy and paste the following snippet to the Draft-1 window. In the JSON snippet, you are creating a dataset called AzureBlobInput that represents input data for an activity in the pipeline. In addition, you specify that the input data is located in the blob container called adfgetstarted and the folder called inputdata.

    {
        "name": "AzureBlobInput",
        "properties": {
            "type": "AzureBlob",
            "linkedServiceName": "AzureStorageLinkedService",
            "typeProperties": {
                "fileName": "input.log",
                "folderPath": "adfgetstarted/inputdata",
                "format": {
                    "type": "TextFormat",
                    "columnDelimiter": ","
                }
            },
            "availability": {
                "frequency": "Month",
                "interval": 1
            },
            "external": true,
            "policy": {}
        }
    }
    

    The following table provides descriptions for the JSON properties used in the snippet:

    Property Description
    type The type property is set to AzureBlob because data resides in an Azure blob storage.
    linkedServiceName Refers to the AzureStorageLinkedService you created earlier.
    folderPath Specifies the blob container and the folder that contains input blobs.
    fileName This property is optional. If you omit this property, all the files from the folderPath are picked. In this tutorial, only the input.log is processed.
    type The log files are in text format, so we use TextFormat.
    columnDelimiter columns in the log files are delimited by comma character (,)
    frequency/interval frequency set to Month and interval is 1, which means that the input slices are available monthly.
    external This property is set to true if the input data is not generated by this pipeline. In this tutorial, the input.log file is not generated by this pipeline, so we set the property to true.

    For more information about these JSON properties, see Azure Blob connector article.

  3. Click Deploy on the command bar to deploy the newly created dataset. You should see the dataset in the tree view on the left.

Create output dataset

Now, you create the output dataset to represent the output data stored in the Azure Blob storage.

  1. In the Data Factory Editor, click ... More on the command bar, click New dataset, and select Azure Blob storage.
  2. Copy and paste the following snippet to the Draft-1 window. In the JSON snippet, you are creating a dataset called AzureBlobOutput, and specifying the structure of the data that is produced by the Hive script. In addition, you specify that the results are stored in the blob container called adfgetstarted and the folder called partitioneddata. The availability section specifies that the output dataset is produced on a monthly basis.

    {
      "name": "AzureBlobOutput",
      "properties": {
        "type": "AzureBlob",
        "linkedServiceName": "AzureStorageLinkedService",
        "typeProperties": {
          "folderPath": "adfgetstarted/partitioneddata",
          "format": {
            "type": "TextFormat",
            "columnDelimiter": ","
          }
        },
        "availability": {
          "frequency": "Month",
          "interval": 1
        }
      }
    }
    

    See Create the input dataset section for descriptions of these properties. You do not set the external property on an output dataset as the dataset is produced by the Data Factory service.

  3. Click Deploy on the command bar to deploy the newly created dataset.
  4. Verify that the dataset is created successfully.

    Tree view with linked services

Create pipeline

In this step, you create your first pipeline with a HDInsightHive activity. Input slice is available monthly (frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset. The properties used in the following JSON are explained at the end of this section.

  1. In the Data Factory Editor, click Ellipsis (…) More commands and then click New pipeline.

    new pipeline button

  2. Copy and paste the following snippet to the Draft-1 window.

    Important

    Replace storageaccountname with the name of your storage account in the JSON.

    {
        "name": "MyFirstPipeline",
        "properties": {
            "description": "My first Azure Data Factory pipeline",
            "activities": [
                {
                    "type": "HDInsightHive",
                    "typeProperties": {
                        "scriptPath": "adfgetstarted/script/partitionweblogs.hql",
                        "scriptLinkedService": "AzureStorageLinkedService",
                        "defines": {
                            "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
                            "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
                        }
                    },
                    "inputs": [
                        {
                            "name": "AzureBlobInput"
                        }
                    ],
                    "outputs": [
                        {
                            "name": "AzureBlobOutput"
                        }
                    ],
                    "policy": {
                        "concurrency": 1,
                        "retry": 3
                    },
                    "scheduler": {
                        "frequency": "Month",
                        "interval": 1
                    },
                    "name": "RunSampleHiveActivity",
                    "linkedServiceName": "HDInsightOnDemandLinkedService"
                }
            ],
            "start": "2017-07-01T00:00:00Z",
            "end": "2017-07-02T00:00:00Z",
            "isPaused": false
        }
    }
    

    In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process Data on an HDInsight cluster.

    The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the scriptLinkedService, called AzureStorageLinkedService), and in script folder in the container adfgetstarted.

    The defines section is used to specify the runtime settings that are passed to the hive script as Hive configuration values (e.g ${hiveconf:inputtable}, ${hiveconf:partitionedtable}).

    The start and end properties of the pipeline specifies the active period of the pipeline.

    In the activity JSON, you specify that the Hive script runs on the compute specified by the linkedServiceNameHDInsightOnDemandLinkedService.

    Note

    See "Pipeline JSON" in Pipelines and activities in Azure Data Factory for details about JSON properties used in the example.

  3. Confirm the following:

    1. input.log file exists in the inputdata folder of the adfgetstarted container in the Azure blob storage
    2. partitionweblogs.hql file exists in the script folder of the adfgetstarted container in the Azure blob storage. Complete the prerequisite steps in the Tutorial Overview if you don't see these files.
    3. Confirm that you replaced storageaccountname with the name of your storage account in the pipeline JSON.
  4. Click Deploy on the command bar to deploy the pipeline. Since the start and end times are set in the past and isPaused is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy.
  5. Confirm that you see the pipeline in the tree view.

    Tree view with pipeline

  6. Congratulations, you have successfully created your first pipeline!

Monitor pipeline

Monitor pipeline using Diagram View

  1. Click X to close Data Factory Editor blades and to navigate back to the Data Factory blade, and click Diagram.

    Diagram tile

  2. In the Diagram View, you see an overview of the pipelines, and datasets used in this tutorial.

    Diagram View

  3. To view all activities in the pipeline, right-click pipeline in the diagram and click Open Pipeline.

    Open pipeline menu

  4. Confirm that you see the HDInsightHive activity in the pipeline.

    Open pipeline view

    To navigate back to the previous view, click Data factory in the breadcrumb menu at the top.

  5. In the Diagram View, double-click the dataset AzureBlobInput. Confirm that the slice is in Ready state. It may take a couple of minutes for the slice to show up in Ready state. If it does not happen after you wait for sometime, see if you have the input file (input.log) placed in the right container (adfgetstarted) and folder (inputdata).

    Input slice in ready state

  6. Click X to close AzureBlobInput blade.
  7. In the Diagram View, double-click the dataset AzureBlobOutput. You see that the slice that is currently being processed.

    Dataset

  8. When processing is done, you see the slice in Ready state.

    Dataset

    Important

    Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take approximately 30 minutes to process the slice.

  9. When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your blob storage for the output data.

    output data

  10. Click the slice to see details about it in a Data slice blade.

    Data slice details

  11. Click an activity run in the Activity runs list to see details about an activity run (Hive activity in our scenario) in an Activity run details window.

    Activity run details

    From the log files, you can see the Hive query that was executed and status information. These logs are useful for troubleshooting any issues. See Monitor and manage pipelines using Azure portal blades article for more details.

Important

The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.

Monitor pipeline using Monitor & Manage App

You can also use Monitor & Manage application to monitor your pipelines. For detailed information about using this application, see Monitor and manage Azure Data Factory pipelines using Monitoring and Management App.

  1. Click Monitor & Manage tile on the home page for your data factory.

    Monitor & Manage tile

  2. You should see Monitor & Manage application. Change the Start time and End time to match start and end times of your pipeline, and click Apply.

    Monitor & Manage App

  3. Select an activity window in the Activity Windows list to see details about it.

    Activity window details

Summary

In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop cluster. You used the Data Factory Editor in the Azure portal to do the following steps:

  1. Created an Azure data factory.
  2. Created two linked services:
    1. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data factory.
    2. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data and produce output data.
  3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline.
  4. Created a pipeline with a HDInsight Hive activity.

Next Steps

In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script on an on-demand HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure SQL, see Tutorial: Copy data from an Azure blob to Azure SQL.

See Also

Topic Description
Pipelines This article helps you understand pipelines and activities in Azure Data Factory and how to use them to construct end-to-end data-driven workflows for your scenario or business.
Datasets This article helps you understand datasets in Azure Data Factory.
Scheduling and execution This article explains the scheduling and execution aspects of Azure Data Factory application model.
Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug pipelines using the Monitoring & Management App.