Move data to Azure Blob storage
If your workflow includes moving data to Azure Blob storage, make sure you are using an efficient strategy. You can either pre-load data in a new blob container before defining it as a storage target, or add the container and then copy your data using Azure HPC Cache.
This article explains the best ways to move data to blob storage for use with Azure HPC Cache.
This article does not apply to NFS-mounted blob storage (ADLS-NFS storage targets). You can use any NFS-based method to populate an ADLS-NFS blob container before adding it to the HPC Cache. Read Pre-load data with NFS protocol to learn more.
Keep these facts in mind:
Azure HPC Cache uses a specialized storage format to organize data in blob storage. This is why a blob storage target must either be a new, empty container, or a blob container that was previously used for Azure HPC Cache data.
Copying data through the Azure HPC Cache to a back-end storage target is more efficient when you use multiple clients and parallel operations. A simple copy command from one client will move data slowly.
A Python-based utility is available to load content into a blob storage container. Read Pre-load data in blob storage to learn more.
If you don't want to use the loading utility, or if you want to add content to an existing storage target, follow the parallel data ingest tips in Copy data through the Azure HPC Cache.
Pre-load data in blob storage with CLFSLoad
You can use the Avere CLFSLoad utility to copy data to a new blob storage container before you add it as a storage target. This utility runs on a single Linux system and writes data in the proprietary format needed for Azure HPC Cache. CLFSLoad is the most efficient way to populate a blob storage container for use with the cache.
The Avere CLFSLoad utility is available by request from your Azure HPC Cache team. Ask your team contact for it, or open a support ticket to request assistance.
This option works with new, empty containers only. Create the container before using Avere CLFSLoad.
Detailed information is included in the Avere CLFSLoad distribution, which is available on request from the Azure HPC Cache team.
A general overview of the process:
- Prepare a Linux system (VM or physical) with Python version 3.6 or later. Python 3.7 is recommended for better performance.
- Install the Avere-CLFSLoad software on the Linux system.
- Execute the transfer from the Linux command line.
The Avere CLFSLoad utility needs the following information:
- The storage account ID that contains your blob storage container
- The name of the empty blob storage container
- A shared access signature (SAS) token that allows the utility to write to the container
- A local path to the data source - either a local directory that contains the data to copy, or a local path to a mounted remote system with the data
Copy data through the Azure HPC Cache
If you don't want to use the Avere CLFSLoad utility, or if you want to add a large amount of data to an existing blob storage target, you can copy it through the cache. Azure HPC Cache is designed to serve multiple clients simultaneously, so to copy data through the cache, you should use parallel writes from multiple clients.
copy commands that you typically use to transfer data from one storage system to another are single-threaded processes that copy only one file at a time. This means that the file server is ingesting only one file at a time - which is a waste of the cache's resources.
This section explains strategies for creating a multi-client, multi-threaded file copying system to move data to blob storage with Azure HPC Cache. It explains file transfer concepts and decision points that can be used for efficient data copying using multiple clients and simple copy commands.
It also explains some utilities that can help. The
msrsync utility can be used to partially automate the process of dividing a dataset into buckets and using rsync commands. The
parallelcp script is another utility that reads the source directory and issues copy commands automatically.
When building a strategy to copy data in parallel, you should understand the tradeoffs in file size, file count, and directory depth.
- When files are small, the metric of interest is files per second.
- When files are large (10MiBi or greater), the metric of interest is bytes per second.
Each copy process has a throughput rate and a files-transferred rate, which can be measured by timing the length of the copy command and factoring the file size and file count. Explaining how to measure the rates is outside the scope of this document, but it is imperative to understand whether you’ll be dealing with small or large files.
Strategies for parallel data ingest with Azure HPC Cache include:
Manual copying - You can manually create a multi-threaded copy on a client by running more than one copy command at once in the background against predefined sets of files or paths. Read Azure HPC Cache data ingest - manual copy method for details.
Partially automated copying with
msrsyncis a wrapper utility that runs multiple parallel
rsyncprocesses. For details, read Azure HPC Cache data ingest - msrsync method.
Scripted copying with
parallelcp- Learn how to create and run a parallel copy script in Azure HPC Cache data ingest - parallel copy script method.
After you set up your storage, learn how clients can mount the cache.