Create resources for a cloud-based, serverless ETL solution using Python on Azure
This article shows you how to use Azure CLI to deploy and configure the Azure resources used for our cloud-based, serverless ETL.
Important
To complete each part of this series, you must create all of these resources in advance. Create each of the resources in a single resource group for organization and ease of resource clean-up.
Prerequisites
Before you can begin the steps in this article, complete the tasks below:
Azure subscription, if you don't have an Azure subscription, create one for free
Python 3.7 or later is installed
python --versionAzure CLI; the CLI commands can be run in the Azure Cloud Shell or you can install Azure CLI locally
az --versionVisual Studio Code on one of the supported platforms is installed
code --versionInstall the latest version of Azure Functions Core Tools
func --versionInstall Visual Studio Code extensions:
1. Set up your dev environment
If you haven't already, follow all the instructions on Configure your local Python dev environment for Azure.
Step 1: Run az login to sign into Azure.
az loginStep 2: When using the Azure CLI, you can turn on the param-persist option that automatically stores parameters for continued use. To learn more, see Azure CLI persisted parameter. [optional]
az config param-persist on
Important
Be sure to create and activate a local virtual environment for this project.
2. Create an Azure Resource Group
Create an Azure Resource Group to organize the Azure services used in this series logically.
Azure Resource Groups can also provide more insights through resource monitoring and cost management.
Step 1: Run az group create to create a resource group for this series.
service_location='eastus' resource_group_name='rg-cloudetl-demo' # Create an Azure Resource Group to organize the Azure services used in this series logically az group create \ --location $service_location \ --name $resource_group_name
Note
You can not host Linux and Windows apps in the same resource group. Suppose you have an existing resource group named rg-cloudetl-demo with a Windows function app or web app. In that case, you must use a different resource group.
3. Configure Azure Blob Storage
Azure Blob Storage is a general-purpose, object storage solution. In this series, blob storage acts as a landing zone for 'source' system data and is a common data engineering scenario.
Create an Azure Storage Account
An Azure Storage Account is a namespace in Azure to store data. The blob storage URL combines the storage account name and the base Azure Storage Blob endpoint address, so the storage account name must be unique.
The below instructions create the Azure Storage Account programmatically. However, you can also create a storage account using the Azure portal.
Step 1: Run az storage account create to create a Storage Account with Kind StorageV2, and assign an Azure Identity.
storage_acct_name='stcloudetldemodata' # Create a general-purpose storage account in your resource group and assign it an identity az storage account create \ --name $storage_acct_name \ --resource-group $resource_group_name \ --location $service_location \ --sku Standard_LRS \ --assign-identityStep 2: Run the az role assignment create to add the 'Storage Blob Data Contributor' role to your user email.
user_email='jejohn@microsoft.com' # Assign the 'Storage Blob Data Contributor' role to your user az role assignment create \ --assignee $user_email \ --role 'Storage Blob Data Contributor' \ --resource-group $resource_group_name
Important
Role assignment creation could take a minute to apply in Azure. It is recommended to wait a moment before running the next command in this article.
Create a Container in the Storage Account
Containers to organize blob data, similar to a file system directory. A container can store an unlimited amount of blobs, and a storage account can have multiple containers.
The below instructions create the Azure Storage Account programmatically. However, you can also create a container using the Azure portal.
Step 1: Run az storage container create to create two new containers in your Storage Account, one for the source dat and the other for archiving processed files.
abs_container_name='demo-cloudetl-data' abs_archive_container_name='demo-cloudetl-archive' # Create a storage container in a storage account. az storage container create \ --name $abs_container_name \ --account-name $storage_acct_name \ --auth-mode login az storage container create \ --name $abs_archive_container_name \ --account-name $storage_acct_name \ --auth-mode loginStep 2: Run az storage account show to capture the storage account ID.
storage_acct_id=$(az storage account show \ --name $storage_acct_name \ --resource-group $resource_group_name \ --query 'id' \ --output tsv)Step 3: Run az storage account keys list to capture one of the storage account access keys for the next section.
# Capture storage account access key1 storage_acct_key1=$(az storage account keys list \ --resource-group $resource_group_name \ --account-name $storage_acct_name \ --query [0].value \ --output tsv)
4. Configure Azure Data Lake Gen2
Azure Data Lake Storage Gen 2 (ADLS) is built upon the Azure Blob File System (ABFS) over TLS/SSL for encryption. An optimized driver for big data workloads was also added to ADLS Gen 2. This feature, along with the cost savings, available storage tiers, and high-availability & disaster recovery options of blob storage, make ADLS Gen 2 the ideal storage solution for big data analytics.
Create Azure Data Lake Storage Account
A storage account is created the same for ADLS Gen 2 as for Azure Blob Storage. The only difference is that the hierarchical namespace (HNS) property must be enabled. The hierarchical namespace is a fundamental part of Data Lake Storage Gen2. This functionality enables the organization of objects/files into a hierarchy of directories for efficient data access.
Step 1: Run az storage account create to create an Azure Data Lake Gen 2 Storage Account with Kind StorageV2, HNS enabled, and assign an Azure Identity.
adls_acct_name='dlscloudetldemo' fsys_name='processed-data-demo' dir_name='finance_data' # Create a ADLS Gen2 account az storage account create \ --name $adls_acct_name \ --resource-group $resource_group_name \ --kind StorageV2 \ --hns \ --location $service_location \ --assign-identityStep 2: Run az storage account keys list to capture one of the ADLS storage account access keys for the next section.
adls_acct_key1=$(az storage account keys list \ --resource-group $resource_group_name \ --account-name $adls_acct_name \ --query [0].value --output tsv)
Note
It is very easy to turn a data lake into a data swamp. So, it is important to govern the data that resides in your data lake.
Azure Purview is a unified data governance service that helps you manage and govern your on-premises, multi-cloud, and software-as-a-service (SaaS) data. Easily create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage.
Configure Data Lake Storage structure
When loading data into a data lake, considerations must be made to ease security, efficient processing, and partitioning efforts. Azure Data Lake Storage Gen 2 uses directories instead of the virtual folders in blob storage. Directories allow for more precise security, control access, and directory level filesystem operations.
Step 1: Run az storage fs create to create a file system in ADLS Gen 2. A file system contains files and folders, similarly to how a container in Azure Blob Storage contains blobs.
# Create a file system in ADLS Gen2 az storage fs create \ --name $fsys_name \ --account-name $adls_acct_name \ --auth-mode loginStep 2: Run az storage fs directory create to create the directory (folder) in the newly created file system to land our processed data.
# Create a directory in ADLS Gen2 file system az storage fs directory create \ --name $dir_name \ --file-system $fsys_name \ --account-name $adls_acct_name \ --auth-mode login
5. Set up Azure Key Vault
It was common practice to store sensitive information from the application code into a 'config.json' file in the past. However, the sensitive information would still be stored in plain text. Additionally, in Azure, the developer also manually copies the values in the local app settings file to the Azure app configuration settings.
A better approach is to use an Azure Key Vault. Azure Key Vault is a centralized cloud solution for storing and managing sensitive information, such as passwords, certificates, and keys. Using Azure Key Vault also provides better access monitoring and logs to see who accesses secret, when, and how.
Configure Azure Key Vault and secrets
Create a new Azure Key Vault within your resource group.
Step 1: Run az keyvault create to create an Azure Key Vault.
key_vault_name='kv-cloudetl-demo' # Provision new Azure Key Vault in our resource group az keyvault create \ --location $service_location \ --name $key_vault_name \ --resource-group $resource_group_nameStep 2: Set a 'secret' in Azure Key Vault to store the Blob Storage Account access key. Run az keyvault secret set to create and set a secret in Azure Key Vault.
abs_secret_name='abs-access-key1' adls_secret_name='adls-access-key1' # Create Secret for Azure Blob Storage Account az keyvault secret set \ --vault-name $key_vault_name \ --name $abs_secret_name \ --value $storage_acct_key1 # Create Secret for Azure Data Lake Storage Account az keyvault secret set \ --vault-name $key_vault_name \ --name $adls_secret_name \ --value $adls_acct_key1
Important
If your secret value contains special characters, you will need to 'escape' the special character by wrapping it with double quotes and the entire string in single quotes. Otherwise, the secret value is not set correctly.
- Will not work: "This is my secret value & it has a special character."
- Will not work: "This is my secret value '&' it has a special character."
- Will work: 'this is my secret value "&" it has a special character'
Set environment variables
This application uses the key vault name as an environment variable called KEY_VAULT_NAME.
export KEY_VAULT_NAME=$key_vault_name
export ABS_SECRET_NAME=$abs_secret_name
export ADLS_SECRET_NAME=$adls_secret_name
6. Create a serverless function
A serverless architecture builds and runs services without infrastructure management, such as provisioning, scaling, and maintaining the resources required to run the Function App. Azure takes care of these management tasks in the backend, allowing developers to focus on building the app.
Create a local Python Function project
A local Python Function project is needed to build and execute our function during development. Create a function project using the Azure Functions Core Tools and following the steps below.
Step 1: Run the
func initcommand to create a functions project in a folder named CloudETLDemo_Local:func init CloudETLDemo_Local --pythonStep 2: Navigate into the project folder:
cd CloudETLDemo_LocalStep 3: Add functions to your project by using the following command, where the
--nameargument is the unique name of your function and the--templateargument specifies the function's trigger (HTTP).func new --name demo_relational_data_cloudetl --template "HTTP trigger" --authlevel "anonymous"Step 4: Check that the function was correctly created by running the function locally. Start the local Azure Functions runtime host from the CloudETLDemo_Local folder:
func startStep 5: Grab the localhost URL at the bottom and append '?name=Functions' to the query string.
http://localhost:7071/api/demo_relational_data_cloudetl?name=FunctionsStep 6: When finished, use 'Ctrl+C' and choose
yto stop the functions host.
Initialize a Python Function App in Azure
An Azure Function App must be created to host our data ingestion function. This Function App is what we deploy our local dev function to once complete.
Step 1: Run az functionapp create to create the function app in Azure.
funcapp_name='CloudETLFunc' # Create a serverless function app in the resource group. az functionapp create \ --name $funcapp_name \ --storage-account $storage_acct_name \ --consumption-plan-location $service_location \ --resource-group $resource_group_name \ --os-type Linux \ --runtime python \ --runtime-version 3.7 \ --functions-version 2Note
App Name is also the default DNS domain for the function app.
Step 2: Run az functionapp config appsettings set to store Azure Key Vault name and Azure Blob Storage access key application configurations.
# Update function app's settings to include Azure Key Vault environment variable. az functionapp config appsettings set --name CloudETLDemo --resource-group rg-cloudetl-demo --settings "KEY_VAULT_NAME=kv-cloudetl-demo" # Update function app's settings to include Azure Blob Storage Access Key in Azure Key Vault secret environment variable. az functionapp config appsettings set --name CloudETLDemo --resource-group rg-cloudetl-demo --settings "ABS_SECRET_NAME=abs-access-key1" # Update function app's settings to include Azure Data Lake Storage Gen 2 Access Key in Azure Key Vault secret environment variable. az functionapp config appsettings set --name CloudETLDemo --resource-group rg-cloudetl-demo --settings "ADLS_SECRET_NAME=adls-access-key1"
7. Assign access policies and roles
A Key Vault access policy determines whether a security principal, user, application, or user group, can do different operations on secrets, keys, and certificates.
Step 1: Create an access policy in Azure Key Vault for the Azure Function App.
The below instructions assign access policies programmatically. However, you can also assign a Key Vault access policy using the Azure portal.
# Generate managed service identity for function app az functionapp identity assign \ --resource-group $resource_group_name \ --name $funcapp_name # Capture function app managed identity id func_principal_id=$(az resource list \ --name $funcapp_name \ --query [*].identity.principalId \ --output tsv) # Capture key vault object/resource id kv_scope=$(az resource list \ --name $key_vault_name \ --query [*].id \ --output tsv) # set permissions policy for function app to key vault - get list and set az keyvault set-policy \ --name $key_vault_name \ --resource-group $resource_group_name \ --object-id $func_principal_id \ --secret-permission get list setStep 2: Run az role assignment create to assign 'Key Vault Secrets User' built-in role to Azure Function App.
# Create a 'Key Vault Contributor' role assignment for function app managed identity az role assignment create \ --assignee $func_principal_id \ --role 'Key Vault Contributor' \ --scope $kv_scope # Assign the 'Storage Blob Data Contributor' role to the function app managed identity az role assignment create \ --assignee $func_principal_id \ --role 'Storage Blob Data Contributor' \ --resource-group $resource_group_name # Assign the 'Storage Queue Data Contributor' role to the function app managed identity az role assignment create \ --assignee $func_principal_id \ --role 'Storage Queue Data Contributor' \ --resource-group $resource_group_name
8. Upload a CSV Blob to the Container
To ingest relational data later in this series, upload a data file (blob) to an Azure Storage container.
Note
If you already have your data (blob) uploaded, you can skip to the next article in this series.
Sample Data
| Segment | Country | Product | Units Sold | Manufacturing Price | Sale Price | Gross Sales | Date |
|---|---|---|---|---|---|---|---|
| Government | Canada | Carretera | 1618.5 | $3.00 | $20.00 | $32,370.00 | 1/1/2014 |
| Government | Germany | Carretera | 1321 | $3.00 | $20.00 | $26,420.00 | 1/1/2014 |
| Midmarket | France | Carretera | 2178 | $3.00 | $15.00 | $32,670.00 | 6/1/2014 |
| Midmarket | Germany | Carretera | 888 | $3.00 | $15.00 | $13,320.00 | 6/1/2014 |
| Midmarket | Mexico | Carretera | 2470 | $3.00 | $15.00 | $37,050.00 | 6/1/2014 |
Step 1: Create a file named 'financial_sample.csv' locally that contains this data by copying the below data into the file:
Segment,Country,Product,Units Sold,Manufacturing Price,Sale Price,Gross Sales,Date Government,Canada,Carretera,1618.5,$3.00,$20.00,"$32,370.00",1/1/2014 Government,Germany,Carretera,1321,$3.00,$20.00,"$26,420.00",1/1/2014 Midmarket,France,Carretera,2178,$3.00,$15.00,"$32,670.00",6/1/2014 Midmarket,Germany,Carretera,888,$3.00,$15.00,"$13,320.00",6/1/2014 Midmarket,Mexico,Carretera,2470,$3.00,$15.00,"$37,050.00",6/1/2014Step 2: Upload your data (blob) to your storage container by running az storage blob upload.
az storage blob upload \ --account-name $storage_acct_name \ --container-name $abs_container_name \ --name 'financial_sample.csv' \ --file 'financial_sample.csv' \ --auth-mode login