Creating a Hadoop Development Cluster in Azure
In my previous post on Hadoop I showed how you could easily deploy a cluster to run on Azure. What was missing was a way to efficiently use the cluster. You could always remote desktop to the Job Tracker and kick off a job but there are better ways.
This post is about actually using the cluster once it has been deployed to Azure. I chose the theme of a Development Cluster to justify making a few changes to how I previously configured the cluster and show some new techniques.
As a developer I expect easy access to the development cluster. The goal is to allow developers to safely connect to the cluster to deploy and debug their map/reduce jobs. SSH provides all the necessary tools for this – secure connection and tunneling. SSH not only allows developers to establish a secure session with the cluster in Azure but it also allows for full integration with IDEs making the typical development tasks a breeze.
The Development Cluster
In this scenario of a development cluster I will use a single host to run both the Name Node and the Job Tracker. This is obviously not true for every development cluster but suffices for this demo. The number of slaves is initially set to 3. You can dynamically change the cluster size as I demonstrated in my previous post. If you are going to try it you might also want to adjust the VM size to meet your needs.
The procedure for deploying a Hadoop cluster has not changed. The dependencies are different though. First is the Hadoop version, I had previously used 0.21 which is not supported by many development tools since it’s an unstable release. I reverted to the stable versions and ended up using 0.20.2. At the time of this writing 0.20.203.0rc1 was out but did not work on Windows. Cygwin needs the OpenSSH package installed to provide the SSH Windows Service (instructions in the SSH post). Finally is YAJSW. That didn’t technically need to be updated, I just grabbed the latest drop for which is Beta-10.8.
Just follow the instructions from my previous post using the updated dependencies and grab this cluster configuration template and this Visual Studio 2010 project instead. You should have the following files in a container in your storage account:
You should be able to deploy your development cluster by publishing the HadoopAzure project directly from Visual Studio.
Connecting to the Cluster
A developer needs to connect to the cluster using SSH. I demonstrated how to do that using PuTTY, the only difference here is that we will need to setup a couple of tunnels. This screen shows the two tunnels required to access the Name Node and Job Tracker.
Accessing the Cluster
With the tunnels open to the development cluster you can use it as if it was local.