Understanding WASB and Hadoop Storage in Azure
Yesterday we learned Why WASB Makes Hadoop on Azure So Very Cool. Now let's dive deeper into Windows Azure storage and WASB. I'll answer some of the common questions I get when people first try to understand how WASB is the same as and different from HDFS.
What is HDFS?
The Hadoop Distributed File System (HDFS) is one of the core Hadoop components, it is how Hadoop manages data and storage. At a high level, when you load a file into Hadoop the "name node" uses HDFS to chunk the file into blocks and it spreads those blocks of data across the worker nodes within the cluster. Each chunk of data is stored on multiple nodes (assuming the replication factor is set to > 1) for higher availability. The name node knows where each chunk of data is stored and that information is used by the job manager to allocate tasks and resources appropriately across nodes.
What is WASB?
Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs. The WASBS variation uses SSL certificates for improved security. It in many ways "is" HDFS. However, WASB creates a layer of abstraction that enables separation of storage. This separation is what enables your data to persist even when no clusters currently exist and enables multiple clusters plus other applications to access a single piece of data all at the same time. This increases functionality and flexibility while reducing costs and reducing the time from question to insight.
What is an Azure blob store, an Azure storage account, and an Azure container? For that matter, what is Azure again?
Azure is Microsoft's cloud solution. A cloud is essentially a collection of host data centers that you don't have to directly manage. You can request services from that cloud. For example, you can request virtual machines and storage, data services such as SQL Azure Database or HDInsight, or services such as Websites or Service Bus. In Azure you store blobs on containers within Azure storage accounts. You grant access to a storage account, you create collections at the container level, and you place blobs (files of any format) inside the containers. This illustration from Microsoft's documentation helps to show the structure:
How do I manage and configure block/chunk size and the replication factor with WASB?
You don't. It's not generally necessary. The data is stored in the Azure storage accounts, remaining accessible to many applications at once. Each blob (file) is replicated 3x within the data center. If you choose to use geo-replication on your account you also get 3 copies of the data in another data center within the same region. The data is chunked and distributed to nodes when a job is run. If you need to change the chunk size for memory related performance at run time that is still an option. You can pass in any Hadoop configuration parameter setting when you create the cluster or you can use the SET command for a given job.
Isn't one of the selling points of Hadoop that the data sits with the compute? How does that work with WASB?
Just like with any Hadoop system the data is loaded into memory on the individual nodes at compute time (when the job runs). The difference with WASB is that the data is loaded from the storage accounts instead of from local disks. Given the way Azure data center backbones are built the performance is generally the same or better than if you used disks locally attached to the VMs.
How do I load data to Hadoop on Azure?
You use any of the many Azure data loading methods. There isn't really anything special about loading data that will be used for Hadoop. As with data used by any other application there are some guidelines around directory structures, optimal numbers of files, and internal format but that is independent of data loading. Some common examples are AZCopy, CloudXplorer and other storage explorers, and SQL Server Integration Services (SSIS).
And yes, I will blog about those guidelines but not here. :-)
Can I have multiple Hadoop clusters pointing to one storage account?
Can I have one Hadoop cluster pointing to multiple storage accounts?
Can I have many Hadoop clusters pointing to multiple storage accounts?
Why, yes. Yes you can.
Do I get to keep my data even if no Hadoop cluster currently exists?
What a fun day to say Yes.
For a caveat see HDInsight: Hive Internal and External Tables Intro.
Is WASB available for any distribution of Hadoop other than HDInsight?
It is my pleasure to answer that with a resounding Yes.
WASB is built into HDInsight (Microsoft's Hadoop on Azure service) and is the default file system. WASB is also available in the Apache source code for Hadoop. Therefore when you install Hadoop, such as Hortonworks HDP or Cloudera EDH/CDH, on Azure VMs you can use WASB with some configuration changes to the cluster.
How do I manage files and directories?
Hive is the most common entry point for Hadoop jobs and with Hive you never point to a single file, you always point to a directory. If you are a stickler for details and want to point out that Azure doesn't have directories, that's technically true. However, Hadoop recognizes that a slash "/" is an indication of a directory. Therefore Hadoop treats the below Azure blob file as if it were AFile.txt in a directory structure of: SomeDirectory/ASubDirectory. But since you don't access individual files in Hive you will reference either SomeDirectory or SomeDirectory/ASubDirectory.
You can add, remove, and modify files in the Azure blob store without regard to whether a Hadoop cluster exists. Each time a job runs it reads the data that currently exists in the directory(s) it references. Hadoop itself can also write to files.
What about ORCFile, Parquet, and AVRO?
They are proprietary formats often used within Hadoop but rarely used outside of Hadoop. There are performance advantages to using those formats for "write once, read many" data inside Hadoop, but chances are high that you won't then be able to access the data without going through one of your Hadoop clusters.
Should I have lots of small files?
Why is too long to answer here. The short answer is to use files that are many multiples of the in-memory chunk size, in the GB or TB size range. Whenever possible use fewer, larger files instead of many small files. If necessary stitch the files together.
That's your storage lesson for today - please put your additional Hadoop on Azure storage questions in the comments or send me a tweet! Thanks for stopping by!