Transfer data to and from Azure

There are several options for transferring data to and from Azure, depending on your needs.

Physical transfer

Using physical hardware to transfer data to Azure is a good option when:

  • Your network is slow or unreliable.
  • Getting more network bandwidth is cost-prohibitive.
  • Security or organizational policies don't allow outbound connections when dealing with sensitive data.

If your primary concern is how long it takes to transfer your data, you might want to run a test to verify whether network transfer is slower than physical transport.

There are two main options for physically transporting data to Azure:

The Azure Import/Export service

The Azure Import/Export service lets you securely transfer large amounts of data to Azure Blob Storage or Azure Files by shipping internal SATA HDDs or SDDs to an Azure datacenter. You can also use this service to transfer data from Azure Storage to hard disk drives and have the drives shipped to you for loading on-premises.

Azure Data Box

Azure Data Box is a Microsoft-provided appliance that works much like the Import/Export service. With Data Box, Microsoft ships you a proprietary, secure, and tamper-resistant transfer appliance and handles the end-to-end logistics, which you can track through the portal. One benefit of the Data Box service is ease of use. You don't need to purchase several hard drives, prepare them, and transfer files to each one. Data Box is supported by many industry-leading Azure partners to make it easier to seamlessly use offline transport to the cloud from their products.

Command-line tools and APIs

Consider these options when you want scripted and programmatic data transfer:

  • The Azure CLI is a cross-platform tool that allows you to manage Azure services and upload data to Storage.

  • AzCopy. Use AzCopy from a Windows or Linux command line to easily copy data to and from Blob Storage, Azure File Storage, and Azure Table Storage with optimal performance. AzCopy supports concurrency and parallelism, and the ability to resume copy operations when interrupted. You can also use AzCopy to copy data from AWS to Azure. For programmatic access, the Microsoft Azure Storage Data Movement Library is the core framework that powers AzCopy. It's provided as a .NET Core library.

  • With PowerShell, the Start-AzureStorageBlobCopy PowerShell cmdlet is an option for Windows administrators who are used to PowerShell.

  • AdlCopy enables you to copy data from Blob Storage into Azure Data Lake Storage. It can also be used to copy data between two Data Lake Storage accounts. However, it can't be used to copy data from Data Lake Storage to Blob Storage.

  • Distcp is used to copy data to and from an HDInsight cluster storage (WASB) into a Data Lake Storage account.

  • Sqoop is an Apache project and part of the Hadoop ecosystem. It comes preinstalled on all HDInsight clusters. It allows data transfer between an HDInsight cluster and relational databases such as SQL, Oracle, MySQL, and so on. Sqoop is a collection of related tools, including import and export tools. Sqoop works with HDInsight clusters by using either Blob Storage or Data Lake Storage attached storage.

  • PolyBase is a technology that accesses data outside a database through the T-SQL language. In SQL Server 2016, it allows you to run queries on external data in Hadoop or to import or export data from Blob Storage. In Azure Synapse Analytics, you can import or export data from Blob Storage and Data Lake Storage. Currently, PolyBase is the fastest method of importing data into Azure Synapse Analytics.

  • Use the Hadoop command line when you have data that resides on an HDInsight cluster head node. You can use the hadoop -copyFromLocal command to copy that data to your cluster's attached storage, such as Blob Storage or Data Lake Storage. In order to use the Hadoop command, you must first connect to the head node. Once connected, you can upload a file to storage.

Graphical interface

Consider the following options if you're only transferring a few files or data objects and don't need to automate the process.

  • Azure Storage Explorer is a cross-platform tool that lets you manage the contents of your Azure storage accounts. It allows you to upload, download, and manage blobs, files, queues, tables, and Azure Cosmos DB entities. Use it with Blob Storage to manage blobs and folders, and upload and download blobs between your local file system and Blob Storage, or between storage accounts.

  • Azure portal. Both Blob Storage and Data Lake Storage provide a web-based interface for exploring files and uploading new files. This option is a good one if you don't want to install tools or issue commands to quickly explore your files, or if you want to upload a handful of new ones.

Data sync and pipelines

  • Azure Data Factory is a managed service best suited for regularly transferring files between many Azure services, on-premises systems, or a combination of the two. By using Data Factory, you can create and schedule data-driven workflows called pipelines that ingest data from disparate data stores. Data Factory can process and transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning. You can create data-driven workflows for orchestrating and automating data movement and data transformation.

  • Pipelines and activities in Data Factory and Azure Synapse Analytics can be used to construct end-to-end data-driven workflows for your data movement and data processing scenarios. Additionally, the Azure Data Factory integration runtime is used to provide data integration capabilities across different network environments.

  • Azure Data Box Gateway transfers data to and from Azure, but it's a virtual appliance, not a hard drive. Virtual machines residing in your on-premises network write data to Data Box Gateway by using the NFS and SMB protocols. The device then transfers your data to Azure.

Key selection criteria

For data transfer scenarios, choose the appropriate system for your needs by answering these questions:

  • Do you need to transfer large amounts of data, where doing so over an internet connection would take too long, be unreliable, or too expensive? If yes, consider physical transfer.

  • Do you prefer to script your data transfer tasks, so they're reusable? If so, select one of the command-line options or Data Factory.

  • Do you need to transfer a large amount of data over a network connection? If so, select an option that's optimized for big data.

  • Do you need to transfer data to or from a relational database? If yes, choose an option that supports one or more relational databases. Some of these options also require a Hadoop cluster.

  • Do you need an automated data pipeline or workflow orchestration? If yes, consider Data Factory.

Capability matrix

The following tables summarize the key differences in capabilities.

Physical transfer

Capability The Import/Export service Data Box
Form factor Internal SATA HDDs or SDDs Secure, tamper-proof, single hardware appliance
Microsoft manages shipping logistics No Yes
Integrates with partner products No Yes
Custom appliance No Yes

Command-line tools

Hadoop/HDInsight:

Capability Distcp Sqoop Hadoop CLI
Optimized for big data Yes Yes Yes
Copy to relational database No Yes No
Copy from relational database No Yes No
Copy to Blob Storage Yes Yes Yes
Copy from Blob Storage Yes Yes No
Copy to Data Lake Storage Yes Yes Yes
Copy from Data Lake Storage Yes Yes No

Other:

Capability Azure CLI AzCopy PowerShell AdlCopy PolyBase
Compatible platforms Linux, OS X, Windows Linux, Windows Windows Linux, OS X, Windows SQL Server, Azure Synapse Analytics
Optimized for big data No Yes No Yes 1 Yes 2
Copy to relational database No No No No Yes
Copy from relational database No No No No Yes
Copy to Blob Storage Yes Yes Yes No Yes
Copy from Blob Storage Yes Yes Yes Yes Yes
Copy to Data Lake Storage No Yes Yes Yes Yes
Copy from Data Lake Storage No No Yes Yes Yes

[1] AdlCopy is optimized for transferring big data when used with a Data Lake Analytics account.

[2] PolyBase performance can be increased by pushing computation to Hadoop and using PolyBase scale-out groups to enable parallel data transfer between SQL Server instances and Hadoop nodes.

Graphical interfaces, data sync, and data pipelines

Capability Azure Storage Explorer Azure portal * Data Factory Data Box Gateway
Optimized for big data No No Yes Yes
Copy to relational database No No Yes No
Copy from relational database No No Yes No
Copy to Blob Storage Yes No Yes Yes
Copy from Blob Storage Yes No Yes No
Copy to Data Lake Storage No No Yes No
Copy from Data Lake Storage No No Yes No
Upload to Blob Storage Yes Yes Yes Yes
Upload to Data Lake Storage Yes Yes Yes Yes
Orchestrate data transfers No No Yes No
Custom data transformations No No Yes No
Pricing model Free Free Pay per usage Pay per unit

* Azure portal in this case represents the web-based exploration tools for Blob Storage and Data Lake Storage.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Next steps