Migrate on-premises Apache Hadoop clusters to Azure HDInsight - motivation and benefits

Article
12/08/2023

This article is the first in a series on best-practices for migrating on-premises Apache Hadoop eco-system deployments to Azure HDInsight. This series of articles is for people who are responsible for the design, deployment, and migration of Apache Hadoop solutions in Azure HDInsight. The roles that may benefit from these articles include cloud architects, Hadoop administrators, and DevOps engineers. Software developers, data engineers, and data scientists should also benefit from the explanation of how different types of clusters work in the cloud.

Why to migrate to Azure HDInsight

Azure HDInsight is a cloud distribution of Hadoop components. Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. HDInsight includes the most popular open-source frameworks such as:

Apache Hadoop
Apache Spark
Apache Hive with LLAP
Apache Kafka
Apache HBase

Azure HDInsight advantages over on-premises Hadoop

Low cost - Costs can be reduced by creating clusters on demand and paying only for what you use. Decoupled compute and storage provides flexibility by keeping the data volume independent of the cluster size.
Automated cluster creation - Automated cluster creation requires minimal setup and configuration. Automation can be used for on-demand clusters.
Managed hardware and configuration - There's no need to worry about the physical hardware or infrastructure with an HDInsight cluster. Just specify the configuration of the cluster, and Azure sets it up.
Easily scalable - HDInsight enables you to scale workloads up or down. Azure takes care of data redistribution and workload rebalancing without interrupting data processing jobs.
Global availability - HDInsight is available in more regions than any other big data analytics offering. Azure HDInsight is also available in Azure Government, China, and Germany, which allows you to meet your enterprise needs in key sovereign areas.
Secure and compliant - HDInsight enables you to protect your enterprise data assets with Azure Virtual Network, encryption, and integration with Microsoft Entra ID. HDInsight also meets the most popular industry and government compliance standards.
Simplified version management - Azure HDInsight manages the version of Hadoop eco-system components and keeps them up to date. Software updates are usually a complex process for on-premises deployments.
Smaller clusters optimized for specific workloads with fewer dependencies between components - A typical on-premises Hadoop setup uses a single cluster that serves many purposes. With Azure HDInsight, workload-specific clusters can be created. Creating clusters for specific workloads removes the complexity of maintaining a single cluster with growing complexity.
Productivity - You can use various tools for Hadoop and Spark in your preferred development environment.
Extensibility with custom tools or third-party applications - HDInsight clusters can be extended with installed components and can also be integrated with the other big data solutions by using one-click deployments from the Azure Market place.
Easy management, administration, and monitoring - Azure HDInsight integrates with Azure Monitor logs to provide a single interface with which you can monitor all your clusters.
Integration with other Azure services - HDInsight can easily be integrated with other popular Azure services such as the following:
- Azure Data Factory (ADF)
- Azure Blob Storage
- Azure Data Lake Storage Gen2
- Azure Cosmos DB
- Azure SQL Database
- Azure Analysis Services
Self-healing processes and components - HDInsight constantly checks the infrastructure and open-source components using its own monitoring infrastructure. It also automatically recovers critical failures such as unavailability of open-source components and nodes. Alerts are triggered in Ambari if any OSS component is failed.

For more information, see the article What is Azure HDInsight and the Apache Hadoop technology stack.

Migration planning process

The following steps are recommended for planning a migration of on-premises Hadoop clusters to Azure HDInsight:

Understand the current on-premises deployment and topologies.
Understand the current project scope, timelines, and team expertise.
Understand the Azure requirements.
Build out a detailed plan based on best practices.

Gathering details to prepare for a migration

This section provides template questionnaires to help gather important information about:

The on-premises deployment
Project details
Azure requirements

On-premises deployment questionnaire

Question	Example	Answer
Topic: Environment
Cluster Distribution version	HDP 2.6.5, CDH 5.7
Big Data eco-system components	HDFS, Yarn, Hive, LLAP, Impala, Kudu, HBase, Spark, MapReduce, Kafka, Zookeeper, Solr, Sqoop, Oozie, Ranger, Atlas, Falcon, Zeppelin, R
Cluster types	Hadoop, Spark, Confluent Kafka, Solr
Number of clusters	4
Number of master nodes	2
Number of worker nodes	100
Number of edge nodes	5
Total Disk space	100 TB
Master Node configuration	m/y, cpu, disk, etc.
Data Nodes configuration	m/y, cpu, disk, etc.
Edge Nodes configuration	m/y, cpu, disk, etc.
HDFS Encryption?	Yes
High Availability	HDFS HA, Metastore HA
Disaster Recovery / Back up	Backup cluster?
Systems that are dependent on Cluster	SQL Server, Teradata, Power BI, MongoDB
Third-party integrations	Tableau, GridGain, Qubole, Informatica, Splunk
Topic: Security
Perimeter security	Firewalls
Cluster authentication & authorization	Active Directory, Ambari, Cloudera Manager, No authentication
HDFS Access Control	Manual, ssh users
Hive authentication & authorization	Sentry, LDAP, AD with Kerberos, Ranger
Auditing	Ambari, Cloudera Navigator, Ranger
Monitoring	Graphite, collectd, `statsd`, Telegraf, InfluxDB
Alerting	`Kapacitor`, Prometheus, Datadog
Data Retention duration	Three years, five years
Cluster Administrators	Single Administrator, Multiple Administrators

Project details questionnaire

Question	Example	Answer
Topic: Workloads and Frequency
MapReduce jobs	10 jobs--twice daily
Hive jobs	100 jobs--every hour
Spark batch jobs	50 jobs--every 15 minutes
Spark Streaming jobs	5 jobs--every 3 minutes
Structured Streaming jobs	5 jobs--every minute
Programming Languages	Python, Scala, Java
Scripting	Shell, Python
Topic: Data
Data sources	Flat files, Json, Kafka, RDBMS
Data orchestration	Oozie workflows, Airflow
In memory lookups	Apache Ignite, Redis
Data destinations	HDFS, RDBMS, Kafka, MPP
Topic: Meta data
Hive DB type	Mysql, Postgres
Number of Hive metastores	2
Number of Hive tables	100
Number of Ranger policies	20
Number of Oozie workflows	100
Topic: Scale
Data volume including Replication	100 TB
Daily ingestion volume	50 GB
Data growth rate	10% per year
Cluster Nodes growth rate	5% per year
Topic: Cluster utilization
Average CPU % used	60%
Average Memory % used	75%
Disk space used	75%
Average Network % used	25%
Topic: Staff
Number of Administrators	2
Number of Developers	10
Number of end users	100
Skills	Hadoop, Spark
Number of available resources for Migration efforts	2
Topic: Limitations
Current limitations	Latency is high
Current challenges	Concurrency issue

Azure requirements questionnaire

Question	Example	Answer
Topic: Infrastructure
Preferred Region	US East
VNet preferred?	Yes
HA / DR Needed?	Yes
Integration with other cloud services?	ADF, Azure Cosmos DB
Topic: Data Movement
Initial load preference	DistCp, Data box, ADF, WANDisco
Data transfer delta	DistCp, AzCopy
Ongoing incremental data transfer	DistCp, Sqoop
Topic: Monitoring & Alerting
Use Azure Monitoring & Alerting vs Integrate third-party monitoring	Use Azure Monitoring & Alerting
Topic: Security preferences
Private and protected data pipeline?	Yes
Domain Joined cluster (ESP)?	Yes
On-Premises AD Sync to Cloud?	Yes
Number of AD users to sync?	100
Ok to sync passwords to cloud?	Yes
Cloud only Users?	Yes
MFA needed?	No
Data authorization requirements?	Yes
Role-based access control?	Yes
Auditing needed?	Yes
Data encryption at rest?	Yes
Data encryption in transit?	Yes
Topic: Re-Architecture preferences
Single cluster vs Specific cluster types	Specific cluster types
Colocated Storage Vs Remote Storage?	Remote Storage
Smaller cluster size as data is stored remotely?	Smaller cluster size
Use multiple smaller clusters rather than a single large cluster?	Use multiple smaller clusters
Use a remote metastore?	Yes
Share metastores between different clusters?	Yes
Deconstruct workloads?	Replace Hive jobs with Spark jobs
Use ADF for data orchestration?	No

Next steps

Read the next article in this series:

Architecture best practices for on-premises to Azure HDInsight Hadoop migration