Introduction to file mount/unmount APIs in Azure Synapse Analytics

The Azure Synapse Studio team built two new mount/unmount APIs in the Microsoft Spark Utilities (mssparkutils) package. You can use these APIs to attach remote storage (Azure Blob Storage or Azure Data Lake Storage Gen2) to all working nodes (driver node and worker nodes). After the storage is in place, you can use the local file API to access data as if it's stored in the local file system. For more information, see Introduction to Microsoft Spark Utilities.

The article shows you how to use mount/unmount APIs in your workspace. You'll learn:

  • How to mount Data Lake Storage Gen2 or Blob Storage.
  • How to access files under the mount point via the local file system API.
  • How to access files under the mount point by using the mssparktuils fs API.
  • How to access files under the mount point by using the Spark read API.
  • How to unmount the mount point.

Warning

Azure file-share mounting is temporarily disabled. You can use Data Lake Storage Gen2 or Azure Blob Storage mounting instead, as described in the next section.

Azure Data Lake Storage Gen1 storage is not supported. You can migrate to Data Lake Storage Gen2 by following the Azure Data Lake Storage Gen1 to Gen2 migration guidance before using the mount APIs.

Mount storage

This section illustrates how to mount Data Lake Storage Gen2 step by step as an example. Mounting Blob Storage works similarly.

The example assumes that you have one Data Lake Storage Gen2 account named storegen2. The account has one container named mycontainer that you want to mount to /test in your Spark pool.

Screenshot of a Data Lake Storage Gen2 storage account.

To mount the container called mycontainer, mssparkutils first needs to check whether you have the permission to access the container. Currently, Azure Synapse Analytics supports three authentication methods for the trigger mount operation: linkedService, accountKey, and sastoken.

We recommend a trigger mount via linked service. This method avoids security leaks, because mssparkutils doesn't store any secret or authentication values itself. Instead, mssparkutils always fetches authentication values from the linked service to request blob data from remote storage.

Screenshot of linked services.

You can create a linked service for Data Lake Storage Gen2 or Blob Storage. Currently, Azure Synapse Analytics supports two authentication methods when you create a linked service:

  • Create a linked service by using an account key

    Screenshot of selections for creating a linked service by using an account key.

  • Create a linked service by using a managed identity

    Screenshot of selections for creating a linked service by using a managed identity.

Note

If you create a linked service by using a managed identity as the authentication method, make sure that the workspace MSI file has the Storage Blob Data Contributor role of the mounted container.

After you create linked service successfully, you can easily mount the container to your Spark pool by using the following Python code:

mssparkutils.fs.mount( 
    "abfss://mycontainer@<accountname>.dfs.core.windows.net", 
    "/test", 
    {"linkedService":"mygen2account"} 
) 

Note

You might need to import mssparkutils if it's not available:

From notebookutils import mssparkutils 

We don't recommend that you mount a root folder, no matter which authentication method you use.

Mount via shared access signature token or account key

In addition to mounting through a linked service, mssparkutils supports explicitly passing an account key or shared access signature (SAS) token as a parameter to mount the target.

For security reasons, we recommend that you store account keys or SAS tokens in Azure Key Vault (as the following example screenshot shows). You can then retrieve them by using the mssparkutil.credentials.getSecret API. For more information, see Manage storage account keys with Key Vault and the Azure CLI (legacy).

Screenshot that shows a secret stored in a key vault.

Here's the sample code:

from notebookutils import mssparkutils  

accountKey = mssparkutils.credentials.getSecret("MountKV","mySecret")  
mssparkutils.fs.mount(  
    "abfss://mycontainer@<accountname>.dfs.core.windows.net",  
    "/test",  
    {"accountKey":accountKey}
) 

Note

For security reasons, don't store credentials in code.

Access files under the mount point by using the mssparktuils fs API

The main purpose of the mount operation is to let customers access the data stored in a remote storage account by using a local file system API. You can also access the data by using the mssparkutils fs API with a mounted path as a parameter. The path format used here is a little different.

Assume that you mounted the Data Lake Storage Gen2 container mycontainer to /test by using the mount API. When you access the data by using a local file system API, the path format is like this:

/synfs/{jobId}/test/{filename}

When you want to access the data by using the mssparkutils fs API, the path format is like this:

synfs:/{jobId}/test/{filename}

You can see that synfs is used as the schema in this case, instead of a part of the mounted path.

The following three examples show how to access a file with a mount point path by using mssparkutils fs. In the examples, 49 is a Spark job ID that we got from calling mssparkutils.env.getJobId().

  • List directories:

    mssparkutils.fs.ls("synfs:/49/test") 
    
  • Read file content:

    mssparkutils.fs.head("synfs:/49/test/myFile.txt") 
    
  • Create a directory:

    mssparkutils.fs.mkdirs("synfs:/49/test/newdir") 
    

Access files under the mount point by using the Spark read API

You can provide a parameter to access the data through the Spark read API. The path format here is the same when you use the mssparkutils fs API:

synfs:/{jobId}/test/{filename}

Read a file from a mounted Data Lake Storage Gen2 storage account

The following example assumes that a Data Lake Storage Gen2 storage account was already mounted, and then you read the file by using a mount path:

%%pyspark 

df = spark.read.load("synfs:/49/test/myFile.csv", format='csv') 
df.show() 

Read a file from a mounted Blob Storage account

If you mounted a Blob Storage account and want to access it by using mssparkutils or the Spark API, you need to explicitly configure the SAS token via Spark configuration before you try to mount the container by using the mount API:

  1. To access a Blob Storage account by using mssparkutils or the Spark API after a trigger mount, update the Spark configuration as shown in the following code example. You can bypass this step if you want to access the Spark configuration only by using the local file API after mounting.

    blob_sas_token = mssparkutils.credentials.getConnectionStringOrCreds("myblobstorageaccount") 
    
    spark.conf.set('fs.azure.sas.mycontainer.<blobStorageAccountName>.blob.core.windows.net', blob_sas_token) 
    
  2. Create the linked service myblobstorageaccount, and mount the Blob Storage account by using the linked service:

    %%spark 
    mssparkutils.fs.mount( 
        "wasbs://mycontainer@<blobStorageAccountName>.blob.core.windows.net", 
        "/test", 
        Map("linkedService" -> "myblobstorageaccount") 
    ) 
    
  3. Mount the Blob Storage container, and then read the file by using a mount path through the local file API:

        # mount the Blob Storage container, and then read the file by using a mount path
        with open("/synfs/64/test/myFile.txt") as f:
        print(f.read())
    
  4. Read the data from the mounted Blob Storage container through the Spark read API:

    %%spark
    // mount blob storage container and then read file using mount path
    val df = spark.read.text("synfs:/49/test/myFile.txt")
    df.show()
    

Unmount the mount point

Use the following code to unmount your mount point (/test in this example):

mssparkutils.fs.unmount("/test") 

Known limitations

  • The mssparkutils fs help function hasn't added the description about the mount/unmount part yet.

  • The unmount mechanism is not automatic. When the application run finishes, to unmount the mount point to release the disk space, you need to explicitly call an unmount API in your code. Otherwise, the mount point will still exist in the node after the application run finishes.

  • Mounting a Data Lake Storage Gen1 storage account is not supported for now.

Next steps