Use Python to manage directories and files in Azure Data Lake Storage Gen2

This article shows you how to use Python to create and manage directories and files in storage accounts that have a hierarchical namespace.

To learn about how to get, set, and update the access control lists (ACL) of directories and files, see Use Python to manage ACLs in Azure Data Lake Storage Gen2.

Package (Python Package Index) | Samples | API reference | Gen1 to Gen2 mapping | Give Feedback

Prerequisites

  • An Azure subscription. See Get Azure free trial.

  • A storage account that has hierarchical namespace enabled. Follow these instructions to create one.

Set up your project

Install the Azure Data Lake Storage client library for Python by using pip.

pip install azure-storage-file-datalake

Add these import statements to the top of your code file.

import os, uuid, sys
from azure.storage.filedatalake import DataLakeServiceClient
from azure.core._match_conditions import MatchConditions
from azure.storage.filedatalake._models import ContentSettings

Connect to the account

To use the snippets in this article, you'll need to create a DataLakeServiceClient instance that represents the storage account.

Connect by using an account key

This is the easiest way to connect to an account.

This example creates a DataLakeServiceClient instance by using an account key.

def initialize_storage_account(storage_account_name, storage_account_key):
    
    try:  
        global service_client

        service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
            "https", storage_account_name), credential=storage_account_key)
    
    except Exception as e:
        print(e)
  • Replace the storage_account_name placeholder value with the name of your storage account.

  • Replace the storage_account_key placeholder value with your storage account access key.

Connect by using Azure Active Directory (Azure AD)

You can use the Azure identity client library for Python to authenticate your application with Azure AD.

This example creates a DataLakeServiceClient instance by using a client ID, a client secret, and a tenant ID. To get these values, see Acquire a token from Azure AD for authorizing requests from a client application.

def initialize_storage_account_ad(storage_account_name, client_id, client_secret, tenant_id):
    
    try:  
        global service_client

        credential = ClientSecretCredential(tenant_id, client_id, client_secret)

        service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
            "https", storage_account_name), credential=credential)
    
    except Exception as e:
        print(e)

Note

For more examples, see the Azure identity client library for Python documentation.

Create a container

A container acts as a file system for your files. You can create one by calling the FileSystemDataLakeServiceClient.create_file_system method.

This example creates a container named my-file-system.

def create_file_system():
    try:
        global file_system_client

        file_system_client = service_client.create_file_system(file_system="my-file-system")
    
    except Exception as e:
        print(e)

Create a directory

Create a directory reference by calling the FileSystemClient.create_directory method.

This example adds a directory named my-directory to a container.

def create_directory():
    try:
        file_system_client.create_directory("my-directory")
    
    except Exception as e:
     print(e)

Rename or move a directory

Rename or move a directory by calling the DataLakeDirectoryClient.rename_directory method. Pass the path of the desired directory a parameter.

This example renames a sub-directory to the name my-directory-renamed.

def rename_directory():
    try:
       
       file_system_client = service_client.get_file_system_client(file_system="my-file-system")
       directory_client = file_system_client.get_directory_client("my-directory")
       
       new_dir_name = "my-directory-renamed"
       directory_client.rename_directory(new_name=directory_client.file_system_name + '/' + new_dir_name)

    except Exception as e:
     print(e)

Delete a directory

Delete a directory by calling the DataLakeDirectoryClient.delete_directory method.

This example deletes a directory named my-directory.

def delete_directory():
    try:
        file_system_client = service_client.get_file_system_client(file_system="my-file-system")
        directory_client = file_system_client.get_directory_client("my-directory")

        directory_client.delete_directory()
    except Exception as e:
     print(e)

Upload a file to a directory

First, create a file reference in the target directory by creating an instance of the DataLakeFileClient class. Upload a file by calling the DataLakeFileClient.append_data method. Make sure to complete the upload by calling the DataLakeFileClient.flush_data method.

This example uploads a text file to a directory named my-directory.

def upload_file_to_directory():
    try:

        file_system_client = service_client.get_file_system_client(file_system="my-file-system")

        directory_client = file_system_client.get_directory_client("my-directory")
        
        file_client = directory_client.create_file("uploaded-file.txt")
        local_file = open("C:\\file-to-upload.txt",'r')

        file_contents = local_file.read()

        file_client.append_data(data=file_contents, offset=0, length=len(file_contents))

        file_client.flush_data(len(file_contents))

    except Exception as e:
      print(e)

Tip

If your file size is large, your code will have to make multiple calls to the DataLakeFileClient.append_data method. Consider using the DataLakeFileClient.upload_data method instead. That way, you can upload the entire file in a single call.

Upload a large file to a directory

Use the DataLakeFileClient.upload_data method to upload large files without having to make multiple calls to the DataLakeFileClient.append_data method.

def upload_file_to_directory_bulk():
    try:

        file_system_client = service_client.get_file_system_client(file_system="my-file-system")

        directory_client = file_system_client.get_directory_client("my-directory")
        
        file_client = directory_client.get_file_client("uploaded-file.txt")

        local_file = open("C:\\file-to-upload.txt",'r')

        file_contents = local_file.read()

        file_client.upload_data(file_contents, overwrite=True)

    except Exception as e:
      print(e)

Download from a directory

Open a local file for writing. Then, create a DataLakeFileClient instance that represents the file that you want to download. Call the DataLakeFileClient.read_file to read bytes from the file and then write those bytes to the local file.

def download_file_from_directory():
    try:
        file_system_client = service_client.get_file_system_client(file_system="my-file-system")

        directory_client = file_system_client.get_directory_client("my-directory")
        
        local_file = open("C:\\file-to-download.txt",'wb')

        file_client = directory_client.get_file_client("uploaded-file.txt")

        download = file_client.download_file()

        downloaded_bytes = download.readall()

        local_file.write(downloaded_bytes)

        local_file.close()

    except Exception as e:
     print(e)

List directory contents

List directory contents by calling the FileSystemClient.get_paths method, and then enumerating through the results.

This example, prints the path of each subdirectory and file that is located in a directory named my-directory.

def list_directory_contents():
    try:
        
        file_system_client = service_client.get_file_system_client(file_system="my-file-system")

        paths = file_system_client.get_paths(path="my-directory")

        for path in paths:
            print(path.name + '\n')

    except Exception as e:
     print(e)

See also