Tutorial: Build your first Azure data factory using Azure PowerShell

In this article, you use Azure PowerShell to create your first Azure data factory. To do the tutorial using other tools/SDKs, select one of the options from the drop-down list.

The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a month between the specified start and end times.

Note

The data pipeline in this tutorial transforms input data to produce output data. It does not copy data from a source data store to a destination data store. For a tutorial on how to copy data using Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database.

A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution in Data Factory.

Prerequisites

Create data factory

In this step, you use Azure PowerShell to create an Azure Data Factory named FirstDataFactoryPSH. A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform input data. Let's start with creating the data factory in this step.

  1. Start Azure PowerShell and run the following command. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen, you need to run these commands again.
    • Run the following command and enter the user name and password that you use to sign in to the Azure portal. PowerShell Login-AzureRmAccount
    • Run the following command to view all the subscriptions for this account. PowerShell Get-AzureRmSubscription
    • Run the following command to select the subscription that you want to work with. This subscription should be the same as the one you used in the Azure portal. PowerShell Get-AzureRmSubscription -SubscriptionName <SUBSCRIPTION NAME> | Set-AzureRmContext
  2. Create an Azure resource group named ADFTutorialResourceGroup by running the following command:

    New-AzureRmResourceGroup -Name ADFTutorialResourceGroup  -Location "West US"
    

    Some of the steps in this tutorial assume that you use the resource group named ADFTutorialResourceGroup. If you use a different resource group, you need to use it in place of ADFTutorialResourceGroup in this tutorial.

  3. Run the New-AzureRmDataFactory cmdlet that creates a data factory named FirstDataFactoryPSH.

    New-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name FirstDataFactoryPSH –Location "West US"
    

    Note the following points:

  • The name of the Azure Data Factory must be globally unique. If you receive the error Data factory name “FirstDataFactoryPSH” is not available, change the name (for example, yournameFirstDataFactoryPSH). Use this name in place of ADFTutorialFactoryPSH while performing steps in this tutorial. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.
  • To create Data Factory instances, you need to be a contributor/administrator of the Azure subscription
  • The name of the data factory may be registered as a DNS name in the future and hence become publically visible.
  • If you receive the error: "This subscription is not registered to use namespace Microsoft.DataFactory", do one of the following and try publishing again:

    • In Azure PowerShell, run the following command to register the Data Factory provider:

      Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory
      

      You can run the following command to confirm that the Data Factory provider is registered:

      Get-AzureRmResourceProvider
      
    • Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or) create a data factory in the Azure portal. This action automatically registers the provider for you.

Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to link data stores/computes to your data store, define input and output datasets to represent input/output data in linked data stores, and then create the pipeline with an activity that uses these datasets.

Create linked services

In this step, you link your Azure Storage account and an on-demand Azure HDInsight cluster to your data factory. The Azure Storage account holds the input and output data for the pipeline in this sample. The HDInsight linked service is used to run a Hive script specified in the activity of the pipeline in this sample. Identify what data store/compute services are used in your scenario and link those services to the data factory by creating linked services.

Create Azure Storage linked service

In this step, you link your Azure Storage account to your data factory. You use the same Azure Storage account to store input/output data and the HQL script file.

  1. Create a JSON file named StorageLinkedService.json in the C:\ADFGetStarted folder with the following content. Create the folder ADFGetStarted if it does not already exist.

    {
        "name": "StorageLinkedService",
        "properties": {
            "type": "AzureStorage",
            "description": "",
            "typeProperties": {
                "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
            }
        }
    }
    

    Replace account name with the name of your Azure storage account and account key with the access key of the Azure storage account. To learn how to get your storage access key, see the information about how to view, copy, and regenerate storage access keys in Manage your storage account.

  2. In Azure PowerShell, switch to the ADFGetStarted folder.
  3. You can use the New-AzureRmDataFactoryLinkedService cmdlet that creates a linked service. This cmdlet and other Data Factory cmdlets you use in this tutorial requires you to pass values for the ResourceGroupName and DataFactoryName parameters. Alternatively, you can use Get-AzureRmDataFactory to get a DataFactory object and pass the object without typing ResourceGroupName and DataFactoryName each time you run a cmdlet. Run the following command to assign the output of the Get-AzureRmDataFactory cmdlet to a $df variable.

    $df=Get-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name FirstDataFactoryPSH
    
  4. Now, run the New-AzureRmDataFactoryLinkedService cmdlet that creates the linked StorageLinkedService service.

    New-AzureRmDataFactoryLinkedService $df -File .\StorageLinkedService.json
    

    If you hadn't run the Get-AzureRmDataFactory cmdlet and assigned the output to the $df variable, you would have to specify values for the ResourceGroupName and DataFactoryName parameters as follows.

    New-AzureRmDataFactoryLinkedService -ResourceGroupName ADFTutorialResourceGroup -DataFactoryName FirstDataFactoryPSH -File .\StorageLinkedService.json
    

    If you close Azure PowerShell in the middle of the tutorial, you have to run the Get-AzureRmDataFactory cmdlet next time you start Azure PowerShell to complete the tutorial.

Create Azure HDInsight linked service

In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically created at runtime and deleted after it is done processing and idle for the specified amount of time. You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See Compute Linked Services for details.

  1. Create a JSON file named HDInsightOnDemandLinkedService.json in the C:\ADFGetStarted folder with the following content.

    {
        "name": "HDInsightOnDemandLinkedService",
        "properties": {
            "type": "HDInsightOnDemand",
            "typeProperties": {
                "version": "3.5",
                "clusterSize": 1,
                "timeToLive": "00:05:00",
                "osType": "Linux",
                "linkedServiceName": "StorageLinkedService"
            }
        }
    }
    

    The following table provides descriptions for the JSON properties used in the snippet:

    Property Description
    ClusterSize Specifies the size of the HDInsight cluster.
    TimeToLive Specifies that the idle time for the HDInsight cluster, before it is deleted.
    linkedServiceName Specifies the storage account that is used to store the logs that are generated by HDInsight

    Note the following points:

    • The Data Factory creates a Linux-based HDInsight cluster for you with the JSON. See On-demand HDInsight Linked Service for details.
    • You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See HDInsight Linked Service for details.
    • The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the processing is done.

      As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: "adfyourdatafactoryname-linkedservicename-datetimestamp". Use tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage.

      See On-demand HDInsight Linked Service for details.

  2. Run the New-AzureRmDataFactoryLinkedService cmdlet that creates the linked service called HDInsightOnDemandLinkedService.

    New-AzureRmDataFactoryLinkedService $df -File .\HDInsightOnDemandLinkedService.json
    

Create datasets

In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to the StorageLinkedService you have created earlier in this tutorial. The linked service points to an Azure Storage account and datasets specify container, folder, file name in the storage that holds input and output data.

Create input dataset

  1. Create a JSON file named InputTable.json in the C:\ADFGetStarted folder with the following content:

    {
        "name": "AzureBlobInput",
        "properties": {
            "type": "AzureBlob",
            "linkedServiceName": "StorageLinkedService",
            "typeProperties": {
                "fileName": "input.log",
                "folderPath": "adfgetstarted/inputdata",
                "format": {
                    "type": "TextFormat",
                    "columnDelimiter": ","
                }
            },
            "availability": {
                "frequency": "Month",
                "interval": 1
            },
            "external": true,
            "policy": {}
        }
    }
    

    The JSON defines a dataset named AzureBlobInput, which represents input data for an activity in the pipeline. In addition, it specifies that the input data is located in the blob container called adfgetstarted and the folder called inputdata.

    The following table provides descriptions for the JSON properties used in the snippet:

    Property Description
    type The type property is set to AzureBlob because data resides in Azure blob storage.
    linkedServiceName refers to the StorageLinkedService you created earlier.
    fileName This property is optional. If you omit this property, all the files from the folderPath are picked. In this case, only the input.log is processed.
    type The log files are in text format, so we use TextFormat.
    columnDelimiter columns in the log files are delimited by the comma character (,).
    frequency/interval frequency set to Month and interval is 1, which means that the input slices are available monthly.
    external this property is set to true if the input data is not generated by the Data Factory service.
  2. Run the following command in Azure PowerShell to create the Data Factory dataset:

    New-AzureRmDataFactoryDataset $df -File .\InputTable.json
    

Create output dataset

Now, you create the output dataset to represent the output data stored in the Azure Blob storage.

  1. Create a JSON file named OutputTable.json in the C:\ADFGetStarted folder with the following content:

    {
      "name": "AzureBlobOutput",
      "properties": {
        "type": "AzureBlob",
        "linkedServiceName": "StorageLinkedService",
        "typeProperties": {
          "folderPath": "adfgetstarted/partitioneddata",
          "format": {
            "type": "TextFormat",
            "columnDelimiter": ","
          }
        },
        "availability": {
          "frequency": "Month",
          "interval": 1
        }
      }
    }
    

    The JSON defines a dataset named AzureBlobOutput, which represents output data for an activity in the pipeline. In addition, it specifies that the results are stored in the blob container called adfgetstarted and the folder called partitioneddata. The availability section specifies that the output dataset is produced on a monthly basis.

  2. Run the following command in Azure PowerShell to create the Data Factory dataset:

    New-AzureRmDataFactoryDataset $df -File .\OutputTable.json
    

Create pipeline

In this step, you create your first pipeline with a HDInsightHive activity. Input slice is available monthly (frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset. The properties used in the following JSON are explained at the end of this section.

  1. Create a JSON file named MyFirstPipelinePSH.json in the C:\ADFGetStarted folder with the following content:

    Important

    Replace storageaccountname with the name of your storage account in the JSON.

    {
        "name": "MyFirstPipeline",
        "properties": {
            "description": "My first Azure Data Factory pipeline",
            "activities": [
                {
                    "type": "HDInsightHive",
                    "typeProperties": {
                        "scriptPath": "adfgetstarted/script/partitionweblogs.hql",
                        "scriptLinkedService": "StorageLinkedService",
                        "defines": {
                            "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
                            "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
                        }
                    },
                    "inputs": [
                        {
                            "name": "AzureBlobInput"
                        }
                    ],
                    "outputs": [
                        {
                            "name": "AzureBlobOutput"
                        }
                    ],
                    "policy": {
                        "concurrency": 1,
                        "retry": 3
                    },
                    "scheduler": {
                        "frequency": "Month",
                        "interval": 1
                    },
                    "name": "RunSampleHiveActivity",
                    "linkedServiceName": "HDInsightOnDemandLinkedService"
                }
            ],
            "start": "2017-07-01T00:00:00Z",
            "end": "2017-07-02T00:00:00Z",
            "isPaused": false
        }
    }
    

    In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process Data on an HDInsight cluster.

    The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the scriptLinkedService, called StorageLinkedService), and in script folder in the container adfgetstarted.

    The defines section is used to specify the runtime settings that be passed to the hive script as Hive configuration values (e.g ${hiveconf:inputtable}, ${hiveconf:partitionedtable}).

    The start and end properties of the pipeline specifies the active period of the pipeline.

    In the activity JSON, you specify that the Hive script runs on the compute specified by the linkedServiceNameHDInsightOnDemandLinkedService.

    Note

    See "Pipeline JSON" in Pipelines and activities in Azure Data Factory for details about JSON properties that are used in the example.

  2. Confirm that you see the input.log file in the adfgetstarted/inputdata folder in the Azure blob storage, and run the following command to deploy the pipeline. Since the start and end times are set in the past and isPaused is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy.

    New-AzureRmDataFactoryPipeline $df -File .\MyFirstPipelinePSH.json
    
  3. Congratulations, you have successfully created your first pipeline using Azure PowerShell!

Monitor pipeline

In this step, you use Azure PowerShell to monitor what’s going on in an Azure data factory.

  1. Run Get-AzureRmDataFactory and assign the output to a $df variable.

    $df=Get-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name FirstDataFactoryPSH
    
  2. Run Get-AzureRmDataFactorySlice to get details about all slices of the EmpSQLTable, which is the output table of the pipeline.

    Get-AzureRmDataFactorySlice $df -DatasetName AzureBlobOutput -StartDateTime 2017-07-01
    

    Notice that the StartDateTime you specify here is the same start time specified in the pipeline JSON. Here is the sample output:

    ResourceGroupName : ADFTutorialResourceGroup
    DataFactoryName   : FirstDataFactoryPSH
    DatasetName       : AzureBlobOutput
    Start             : 7/1/2017 12:00:00 AM
    End               : 7/2/2017 12:00:00 AM
    RetryCount        : 0
    State             : InProgress
    SubState          :
    LatencyStatus     :
    LongRetryCount    : 0
    
  3. Run Get-AzureRmDataFactoryRun to get the details of activity runs for a specific slice.

    Get-AzureRmDataFactoryRun $df -DatasetName AzureBlobOutput -StartDateTime 2017-07-01
    

    Here is the sample output:

    Id                  : 0f6334f2-d56c-4d48-b427-d4f0fb4ef883_635268096000000000_635292288000000000_AzureBlobOutput
    ResourceGroupName   : ADFTutorialResourceGroup
    DataFactoryName     : FirstDataFactoryPSH
    DatasetName         : AzureBlobOutput
    ProcessingStartTime : 12/18/2015 4:50:33 AM
    ProcessingEndTime   : 12/31/9999 11:59:59 PM
    PercentComplete     : 0
    DataSliceStart      : 7/1/2017 12:00:00 AM
    DataSliceEnd        : 7/2/2017 12:00:00 AM
    Status              : AllocatingResources
    Timestamp           : 12/18/2015 4:50:33 AM
    RetryAttempt        : 0
    Properties          : {}
    ErrorMessage        :
    ActivityName        : RunSampleHiveActivity
    PipelineName        : MyFirstPipeline
    Type                : Script
    

    You can keep running this cmdlet until you see the slice in Ready state or Failed state. When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your blob storage for the output data. Creation of an on-demand HDInsight cluster usually takes some time.

    output data

Important

Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take approximately 30 minutes to process the slice.

The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.

Summary

In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop cluster. You used the Data Factory Editor in the Azure portal to do the following steps:

  1. Created an Azure data factory.
  2. Created two linked services:
    1. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data factory.
    2. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data and produce output data.
  3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline.
  4. Created a pipeline with a HDInsight Hive activity.

Next steps

In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script on an on-demand Azure HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure SQL, see Tutorial: Copy data from an Azure Blob to Azure SQL.

See Also

Topic Description
Data Factory Cmdlet Reference See comprehensive documentation on Data Factory cmdlets
Pipelines This article helps you understand pipelines and activities in Azure Data Factory and how to use them to construct end-to-end data-driven workflows for your scenario or business.
Datasets This article helps you understand datasets in Azure Data Factory.
Scheduling and Execution This article explains the scheduling and execution aspects of Azure Data Factory application model.
Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug pipelines using the Monitoring & Management App.