Migrate on-premises Apache Hadoop clusters to Azure HDInsight - data migration best practices

This article gives recommendations for data migration to Azure HDInsight. It's part of a series that provides best practices to assist with migrating on-premises Apache Hadoop systems to Azure HDInsight.

Migrate data from on-premises to Azure

There are two main options to migrate data from on-premises to Azure environment:

  1. Transfer data over network with TLS
    1. Over internet
    2. Express Route
  2. Shipping data
    1. Import / Export service
      • Internal SATA HDDs or SSDs only
      • Encrypted at REST (AES-128 / AES-256)
      • Import job can have up to 10 disks
      • Available in all Public regions & GA
    2. Data Box
      • Up to 80 TB of data per Data box
      • Encrypted at REST (AES-256)
      • Uses NAS protocols and supports common data copy tools
      • Ruggedized hardware
      • Available in US only & Public Preview

The following table has approximate data transfer duration based on the data volume and network bandwidth. Use a Data box if the data migration is expected to take more than three weeks.

Data Qty Network Bandwidth
45 Mbps (T3) 100 Mbps 1 Gbps
1 TB 2 days 1 day 2 hours
10 TB 22 days 10 days 1 day
35 TB 76 days 34 days 3 days
80 TB 173 days 78 days 8 days
100 TB 216 days 97 days 10 days
200 TB 1 year 194 days 19 days
500 TB 3 years 1 year 49 days
1 PB 6 years 3 years 97 days
2 PB 12 years 5 years 194 days

Tools native to Azure, like DistCp, Azure Data Factory, and AzureCp, can be used to transfer data over the network. The third-party tool WANDisco can also be used for the same purpose. Kafka Mirrormaker and Sqoop can be used for ongoing data transfer from on-premises to Azure storage systems.

Performance considerations when using Apache DistCp

DistCp is an Apache project that uses a MapReduce Map job to transfer data, handle errors, and recover from those errors. It assigns a list of source files to each Map task. The Map task then copies all of its assigned files to the destination. There are several techniques can improve the performance of DistCp.

Increase the number of Mappers

DistCp tries to create map tasks so that each one copies roughly the same number of bytes. By default, DistCp jobs use 20 mappers. Using more Mappers for Distcp (with the 'm' parameter at command line) increases parallelism during the data transfer process and decreases the length of the data transfer. However, there are two things to consider while increasing the number of Mappers:

  1. DistCp's lowest granularity is a single file. Specifying a number of Mappers more than the number of source files does not help and will waste the available cluster resources.
  2. Consider the available Yarn memory on the cluster to determine the number of Mappers. Each Map task is launched as a Yarn container. Assuming that no other heavy workloads are running on the cluster, the number of Mappers can be determined by the following formula: m = (number of worker nodes * YARN memory for each worker node) / YARN container size. However, If other applications are using memory, then choose to only use a portion of YARN memory for DistCp jobs.

Use more than one DistCp job

When the size of the dataset to be moved is larger than 1 TB, use more than one DistCp job. Using more than one job limits the impact of failures. If any job fails, you only need to restart that specific job rather than all of the jobs.

Consider splitting files

If there are a small number of large files, then consider splitting them into 256-MB file chunks to get more potential concurrency with more Mappers.

Use the 'strategy' command-line parameter

Consider using strategy = dynamic parameter in the command line. The default value of the strategy parameter is uniform size, in which case each map copies roughly the same number of bytes. When this parameter is changed to dynamic, the listing file is split into several "chunk-files". The number of chunk-files is a multiple of the number of maps. Each map task is assigned one of the chunk-files. After all the paths in a chunk are processed, the current chunk is deleted and a new chunk is acquired. The process continues until no more chunks are available. This "dynamic" approach allows faster map-tasks to consume more paths than slower ones, thus speeding up the DistCp job overall.

Increase the number of threads

See if increasing the -numListstatusThreads parameter improves performance. This parameter controls the number of threads to use for building file listing and 40 is the maximum value.

Use the output committer algorithm

See if passing the parameter -Dmapreduce.fileoutputcommitter.algorithm.version=2 improves DistCp performance. This output committer algorithm has optimizations around writing output files to the destination. The following command is an example that shows the usage of different parameters:

hadoop distcp -Dmapreduce.fileoutputcommitter.algorithm.version=2 -numListstatusThreads 30 -m 100 -strategy dynamic hdfs://nn1:8020/foo/bar wasb://<container_name>@<storage_account_name>.blob.core.windows.net/foo/

Metadata migration


The hive metastore can be migrated either by using the scripts or by using the DB Replication.

Hive metastore migration using scripts

  1. Generate the Hive DDLs from on-prem Hive metastore. This step can be done using a [wrapper bash script].(https://github.com/hdinsight/hdinsight.github.io/blob/master/hive/hive-export-import-metastore.md)
  2. Edit the generated DDL to replace HDFS url with WASB/ADLS/ABFS URLs
  3. Run the updated DDL on the metastore from the HDInsight cluster
  4. Make sure that the Hive metastore version is compatible between on-premises and cloud

Hive metastore migration using DB replication

  • Set up Database Replication between on-premises Hive metastore DB and HDInsight metastore DB
  • Use the "Hive MetaTool" to replace HDFS url with WASB/ADLS/ABFS urls, for example:
./hive --service metatool -updateLocation hdfs://nn1:8020/ wasb://<container_name>@<storage_account_name>.blob.core.windows.net/


  • Export on-premises Ranger policies to xml files.
  • Transform on-prem specific HDFS-based paths to WASB/ADLS using a tool like XSLT.
  • Import the policies on to Ranger running on HDInsight.

Next steps

Read the next article in this series: