Create resources for a cloud-based, serverless ETL solution using Python on Azure

This article shows you how to use Azure CLI to deploy and configure the Azure resources used for our cloud-based, serverless ETL.

Resources diagram

Important

To complete each part of this series, you must create all of these resources in advance. Create each of the resources in a single resource group for organization and ease of resource clean-up.

Prerequisites

Before you can begin the steps in this article, complete the tasks below:

1. Set up your dev environment

If you haven't already, follow all the instructions on Configure your local Python dev environment for Azure.

  • Step 1: Run az login to sign into Azure.

    az login
    
  • Step 2: When using the Azure CLI, you can turn on the param-persist option that automatically stores parameters for continued use. To learn more, see Azure CLI persisted parameter. [optional]

    az config param-persist on
    

Important

Be sure to create and activate a local virtual environment for this project.

2. Create an Azure Resource Group

Create an Azure Resource Group to organize the Azure services used in this series logically.

Azure Resource Groups can also provide more insights through resource monitoring and cost management.

  • Step 1: Run az group create to create a resource group for this series.

    service_location='eastus'
    resource_group_name='rg-cloudetl-demo'
    
    # Create an Azure Resource Group to organize the Azure services used in this series logically
    az group create \
        --location $service_location \
        --name $resource_group_name
    

Note

You can not host Linux and Windows apps in the same resource group. Suppose you have an existing resource group named rg-cloudetl-demo with a Windows function app or web app. In that case, you must use a different resource group.

3. Configure Azure Blob Storage

Azure Blob Storage is a general-purpose, object storage solution. In this series, blob storage acts as a landing zone for 'source' system data and is a common data engineering scenario.

Create an Azure Storage Account

An Azure Storage Account is a namespace in Azure to store data. The blob storage URL combines the storage account name and the base Azure Storage Blob endpoint address, so the storage account name must be unique.

The below instructions create the Azure Storage Account programmatically. However, you can also create a storage account using the Azure portal.

  • Step 1: Run az storage account create to create a Storage Account with Kind StorageV2, and assign an Azure Identity.

    storage_acct_name='stcloudetldemodata'
    
    # Create a general-purpose storage account in your resource group and assign it an identity
    az storage account create \
        --name $storage_acct_name \
        --resource-group $resource_group_name \
        --location $service_location \
        --sku Standard_LRS \
        --assign-identity
    
  • Step 2: Run the az role assignment create to add the 'Storage Blob Data Contributor' role to your user email.

    user_email='jejohn@microsoft.com'
    
    # Assign the 'Storage Blob Data Contributor' role to your user
    az role assignment create \
        --assignee $user_email \
        --role 'Storage Blob Data Contributor' \
        --resource-group  $resource_group_name
    

Important

Role assignment creation could take a minute to apply in Azure. It is recommended to wait a moment before running the next command in this article.

Create a Container in the Storage Account

Containers to organize blob data, similar to a file system directory. A container can store an unlimited amount of blobs, and a storage account can have multiple containers.

The below instructions create the Azure Storage Account programmatically. However, you can also create a container using the Azure portal.

  • Step 1: Run az storage container create to create two new containers in your Storage Account, one for the source dat and the other for archiving processed files.

    abs_container_name='demo-cloudetl-data'
    abs_archive_container_name='demo-cloudetl-archive'
    
    # Create a storage container in a storage account.
    az storage container create \
        --name $abs_container_name \
        --account-name $storage_acct_name \
        --auth-mode login
    
    az storage container create \
        --name $abs_archive_container_name \
        --account-name $storage_acct_name \
        --auth-mode login
    
  • Step 2: Run az storage account show to capture the storage account ID.

    storage_acct_id=$(az storage account show \
                        --name $storage_acct_name  \
                        --resource-group $resource_group_name \
                        --query 'id' \
                        --output tsv)
    
  • Step 3: Run az storage account keys list to capture one of the storage account access keys for the next section.

    # Capture storage account access key1
    storage_acct_key1=$(az storage account keys list \
                            --resource-group $resource_group_name \
                            --account-name $storage_acct_name \
                            --query [0].value \
                            --output tsv)
    

4. Configure Azure Data Lake Gen2

Azure Data Lake Storage Gen 2 (ADLS) is built upon the Azure Blob File System (ABFS) over TLS/SSL for encryption. An optimized driver for big data workloads was also added to ADLS Gen 2. This feature, along with the cost savings, available storage tiers, and high-availability & disaster recovery options of blob storage, make ADLS Gen 2 the ideal storage solution for big data analytics.

Create Azure Data Lake Storage Account

A storage account is created the same for ADLS Gen 2 as for Azure Blob Storage. The only difference is that the hierarchical namespace (HNS) property must be enabled. The hierarchical namespace is a fundamental part of Data Lake Storage Gen2. This functionality enables the organization of objects/files into a hierarchy of directories for efficient data access.

  • Step 1: Run az storage account create to create an Azure Data Lake Gen 2 Storage Account with Kind StorageV2, HNS enabled, and assign an Azure Identity.

    adls_acct_name='dlscloudetldemo'
    fsys_name='processed-data-demo'
    dir_name='finance_data'
    
    # Create a ADLS Gen2 account
    az storage account create \
        --name $adls_acct_name \
        --resource-group $resource_group_name \
        --kind StorageV2 \
        --hns \
        --location $service_location \
        --assign-identity
    
  • Step 2: Run az storage account keys list to capture one of the ADLS storage account access keys for the next section.

    adls_acct_key1=$(az storage account keys list \
                        --resource-group $resource_group_name \
                        --account-name $adls_acct_name \
                        --query [0].value
                        --output tsv)
    

Note

It is very easy to turn a data lake into a data swamp. So, it is important to govern the data that resides in your data lake.

Azure Purview is a unified data governance service that helps you manage and govern your on-premises, multi-cloud, and software-as-a-service (SaaS) data. Easily create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage.

Configure Data Lake Storage structure

When loading data into a data lake, considerations must be made to ease security, efficient processing, and partitioning efforts. Azure Data Lake Storage Gen 2 uses directories instead of the virtual folders in blob storage. Directories allow for more precise security, control access, and directory level filesystem operations.

  • Step 1: Run az storage fs create to create a file system in ADLS Gen 2. A file system contains files and folders, similarly to how a container in Azure Blob Storage contains blobs.

    # Create a file system in ADLS Gen2
    az storage fs create \
        --name $fsys_name \
        --account-name $adls_acct_name \
        --auth-mode login
    
  • Step 2: Run az storage fs directory create to create the directory (folder) in the newly created file system to land our processed data.

    # Create a directory in ADLS Gen2 file system
    az storage fs directory create \
        --name $dir_name \
        --file-system $fsys_name \
        --account-name $adls_acct_name \
        --auth-mode login
    

5. Set up Azure Key Vault

It was common practice to store sensitive information from the application code into a 'config.json' file in the past. However, the sensitive information would still be stored in plain text. Additionally, in Azure, the developer also manually copies the values in the local app settings file to the Azure app configuration settings.

A better approach is to use an Azure Key Vault. Azure Key Vault is a centralized cloud solution for storing and managing sensitive information, such as passwords, certificates, and keys. Using Azure Key Vault also provides better access monitoring and logs to see who accesses secret, when, and how.

Configure Azure Key Vault and secrets

Create a new Azure Key Vault within your resource group.

  • Step 1: Run az keyvault create to create an Azure Key Vault.

    key_vault_name='kv-cloudetl-demo'
    
    # Provision new Azure Key Vault in our resource group
    az keyvault create  \
        --location $service_location \
        --name $key_vault_name \
        --resource-group $resource_group_name
    
  • Step 2: Set a 'secret' in Azure Key Vault to store the Blob Storage Account access key. Run az keyvault secret set to create and set a secret in Azure Key Vault.

    abs_secret_name='abs-access-key1'
    adls_secret_name='adls-access-key1'
    
    # Create Secret for Azure Blob Storage Account
    az keyvault secret set \
        --vault-name $key_vault_name \
        --name $abs_secret_name \
        --value $storage_acct_key1
    
    # Create Secret for Azure Data Lake Storage Account
    az keyvault secret set \
        --vault-name $key_vault_name \
        --name $adls_secret_name \
        --value $adls_acct_key1
    

Important

If your secret value contains special characters, you will need to 'escape' the special character by wrapping it with double quotes and the entire string in single quotes. Otherwise, the secret value is not set correctly.

  • Will not work: "This is my secret value & it has a special character."
  • Will not work: "This is my secret value '&' it has a special character."
  • Will work: 'this is my secret value "&" it has a special character'

Set environment variables

This application uses the key vault name as an environment variable called KEY_VAULT_NAME.

export KEY_VAULT_NAME=$key_vault_name
export ABS_SECRET_NAME=$abs_secret_name
export ADLS_SECRET_NAME=$adls_secret_name

6. Create a serverless function

A serverless architecture builds and runs services without infrastructure management, such as provisioning, scaling, and maintaining the resources required to run the Function App. Azure takes care of these management tasks in the backend, allowing developers to focus on building the app.

Create a local Python Function project

A local Python Function project is needed to build and execute our function during development. Create a function project using the Azure Functions Core Tools and following the steps below.

  • Step 1: Run the func init command to create a functions project in a folder named CloudETLDemo_Local:

    func init CloudETLDemo_Local --python
    
  • Step 2: Navigate into the project folder:

    cd CloudETLDemo_Local
    
  • Step 3: Add functions to your project by using the following command, where the --name argument is the unique name of your function and the --template argument specifies the function's trigger (HTTP).

    func new --name demo_relational_data_cloudetl --template "HTTP trigger" --authlevel "anonymous"
    
  • Step 4: Check that the function was correctly created by running the function locally. Start the local Azure Functions runtime host from the CloudETLDemo_Local folder:

    func start
    
  • Step 5: Grab the localhost URL at the bottom and append '?name=Functions' to the query string.

    http://localhost:7071/api/demo_relational_data_cloudetl?name=Functions
    
  • Step 6: When finished, use 'Ctrl+C' and choose y to stop the functions host.

Initialize a Python Function App in Azure

An Azure Function App must be created to host our data ingestion function. This Function App is what we deploy our local dev function to once complete.

  • Step 1: Run az functionapp create to create the function app in Azure.

    funcapp_name='CloudETLFunc'
    
    # Create a serverless function app in the resource group.
    az functionapp create \
        --name $funcapp_name \
        --storage-account $storage_acct_name \
        --consumption-plan-location $service_location \
        --resource-group $resource_group_name \
        --os-type Linux \
        --runtime python \
        --runtime-version 3.7 \
        --functions-version 2
    

    Note

    App Name is also the default DNS domain for the function app.

  • Step 2: Run az functionapp config appsettings set to store Azure Key Vault name and Azure Blob Storage access key application configurations.

    # Update function app's settings to include Azure Key Vault environment variable.
    az functionapp config appsettings set --name CloudETLDemo --resource-group rg-cloudetl-demo --settings "KEY_VAULT_NAME=kv-cloudetl-demo"
    
    # Update function app's settings to include Azure Blob Storage Access Key in Azure Key Vault secret environment variable.
    az functionapp config appsettings set --name CloudETLDemo --resource-group rg-cloudetl-demo --settings  "ABS_SECRET_NAME=abs-access-key1"
    
    # Update function app's settings to include Azure Data Lake Storage Gen 2 Access Key in Azure Key Vault secret environment variable.
    az functionapp config appsettings set --name CloudETLDemo --resource-group rg-cloudetl-demo --settings  "ADLS_SECRET_NAME=adls-access-key1"
    

7. Assign access policies and roles

A Key Vault access policy determines whether a security principal, user, application, or user group, can do different operations on secrets, keys, and certificates.

  • Step 1: Create an access policy in Azure Key Vault for the Azure Function App.

    The below instructions assign access policies programmatically. However, you can also assign a Key Vault access policy using the Azure portal.

    # Generate managed service identity for function app
    az functionapp identity assign \
        --resource-group $resource_group_name \
        --name $funcapp_name
    
    # Capture function app managed identity id
    func_principal_id=$(az resource list \
                --name $funcapp_name \
                --query [*].identity.principalId \
                --output tsv)
    
    # Capture key vault object/resource id
    kv_scope=$(az resource list \
                    --name $key_vault_name \
                    --query [*].id \
                    --output tsv)
    
    # set permissions policy for function app to key vault - get list and set
    az keyvault set-policy \
        --name $key_vault_name \
        --resource-group $resource_group_name \
        --object-id $func_principal_id \
        --secret-permission get list set
    
  • Step 2: Run az role assignment create to assign 'Key Vault Secrets User' built-in role to Azure Function App.

    # Create a 'Key Vault Contributor' role assignment for function app managed identity
    az role assignment create \
        --assignee $func_principal_id \
        --role 'Key Vault Contributor' \
        --scope $kv_scope
    
    # Assign the 'Storage Blob Data Contributor' role to the function app managed identity
    az role assignment create \
        --assignee $func_principal_id \
        --role 'Storage Blob Data Contributor' \
        --resource-group  $resource_group_name
    
    # Assign the 'Storage Queue Data Contributor' role to the function app managed identity
    az role assignment create \
        --assignee $func_principal_id \
        --role 'Storage Queue Data Contributor' \
        --resource-group  $resource_group_name
    

8. Upload a CSV Blob to the Container

To ingest relational data later in this series, upload a data file (blob) to an Azure Storage container.

Note

If you already have your data (blob) uploaded, you can skip to the next article in this series.

Sample Data

Segment Country Product Units Sold Manufacturing Price Sale Price Gross Sales Date
Government Canada Carretera 1618.5 $3.00 $20.00 $32,370.00 1/1/2014
Government Germany Carretera 1321 $3.00 $20.00 $26,420.00 1/1/2014
Midmarket France Carretera 2178 $3.00 $15.00 $32,670.00 6/1/2014
Midmarket Germany Carretera 888 $3.00 $15.00 $13,320.00 6/1/2014
Midmarket Mexico Carretera 2470 $3.00 $15.00 $37,050.00 6/1/2014
  • Step 1: Create a file named 'financial_sample.csv' locally that contains this data by copying the below data into the file:

    Segment,Country,Product,Units Sold,Manufacturing Price,Sale Price,Gross Sales,Date
    Government,Canada,Carretera,1618.5,$3.00,$20.00,"$32,370.00",1/1/2014
    Government,Germany,Carretera,1321,$3.00,$20.00,"$26,420.00",1/1/2014
    Midmarket,France,Carretera,2178,$3.00,$15.00,"$32,670.00",6/1/2014
    Midmarket,Germany,Carretera,888,$3.00,$15.00,"$13,320.00",6/1/2014
    Midmarket,Mexico,Carretera,2470,$3.00,$15.00,"$37,050.00",6/1/2014
    
  • Step 2: Upload your data (blob) to your storage container by running az storage blob upload.

    az storage blob upload \
        --account-name $storage_acct_name \
        --container-name $abs_container_name \
        --name 'financial_sample.csv' \
        --file 'financial_sample.csv' \
        --auth-mode login
    

Next Step