Azure@home Part 3: Azure Storage

This post is part of a series diving into the implementation of the @home With Windows Azure project, which formed the basis of a webcast series by Developer Evangelists Brian Hitney and Jim O’Neil. Be sure to read the introductory post for the context of this and subsequent articles in the series.

In case you haven’t heard: through the end of October, Microsoft is offering 500 free monthly passes (in the US only) for you to put Windows Azure and SQL Azure through its paces.  No credit card is required, and you can apply on-line.  It appears to be equivalent to the two-week Azure tokens we provided to those that attended our live webcast series – only twice as long in duration!   Now you have no excuse not to give the @home with Windows Azure project a go!

In my last post, I started diving into the WebRole code of the Azure@home project covering what the application is doing on startup, and I’d promised to dive deeper into the default.aspx and status.aspx implementations next.  Well, I’m going to renege on that!  As I was writing that next post, I felt myself in a chicken-and-egg situation, where I couldn’t really talk in depth about the implementation of those ASP.NET pages without talking about Azure storage first, so I’m inserting this blog post in the flow to introduce Azure storage.  If you’ve already had some experience with Azure and have set up and accessed an Azure storage account, much of this article may be old hat for you, but I wanted to make sure everyone had a firm foundation before moving on.

Azure Storage 101

Azure storage is one of the two main components of an Azure service account (the other being a hosted service, namely a collection of web and worker roles).  Each Azure project can support up to five storage accounts, and each account can accommodate up to 100 terabytes of data.  That data can be partitioned across three storage constructs:

  • blobs – for unstructured data of any type or size, with support for metadata,
  • queues – for guaranteed delivery of small (8K or less) messages, typically used for asynchronous, inter-role communication, and
  • tables – for structured, non-schematized, non-relational data.

Also available are Windows Azure drives, which provide mountable NTFS file volumes and are implemented on top of blob storage.

Azure@home uses only table storage; however, all storage shares some common attributes

Development Storage

To support the development and testing of applications with Windows Azure, you may be aware that the Windows Azure SDK includes a tool known as csrun, which provides a simulation of Azure compute and storage capabilities on your local machine – often referred to as the development fabric.  csrun is a stand-alone application, but that’s transparent when you’re running your application locally, since Visual Studio communicates directly with the fabric to support common developer tasks like debugging.  In terms of Azure storage, the development fabric simulates tables, blobs, and queues via a SQLExpress database (by default, but you can use the utility dsinit to point to a different instance of SQL Server on your local machine).

Development storage acts like a full-fledged storage account in an Azure data center and so requires the same type of authentication mechanisms – namely an account name and a key – but it’s a fixed and well-known value:

Account name: devstoreaccount1
Account key: Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==

That’s probably not a value you’re going to commit to memory though, so as we’ll see in a subsequent post, there’s a way to configure a connection string (just as you might a database connection string) to easily reference your development storage account.  

Keep in mind though that development storage is not implemented in exactly the same way as a true Azure storage account (for one thing, few of us have 100TB of disk space locally!), but for the majority of your needs it should suffice as you’re developing and testing your application.  Also note that you can run your application in the development fabric but access storage (blobs, queues, and tables, but not drives) in the cloud by simply specifying an account name and key for a bona fide Azure storage account.  You might do that as a second level of testing before incurring the expense and time to deploy your web and worker roles to the cloud.

Provisioning an Azure Storage Account

Sooner or later, you’re going to outgrow the local development storage and want to test your application in the actual cloud.  The first step toward doing so is creating a storage account within your Azure project.  You accomplish this via the Windows Azure Devloper Portal as I’ve outlined below.  If you’re well-versed in these steps, feel free to skip ahead .

Browse to https://windows.azure.com, and enter your Azure Live ID account credentials
Select the project name to which you want to add a storage account.  Projects are defined when the account is provisioned using the Microsoft Online Services  Customer Portal (MOCP) and so are in the purview of the account owner (the person getting the bill) versus the Azure services administrator, which is currently your role.
Once you’ve selected the project, you’ll see a page listing the storage and hosted services you’ve provisioned so far for this project.  If this is a new project, there won’t be anything listed, but you’ll see a New Service link (actually two of them) to get you started. Selecting one of those links leads you to…
a screen that prompts for the type of service you’d like to set up – either a storage service or a hosted service.  Here we want a storage service; a hosted service comprises a collection of web and worker roles, and that’s what we’ll eventually deploy the Azure@home application itself to.
Each service within a project is identified by a name and an optional description.  This information is not publically exposed; it’s shown in the portal only for your own tracking purposes.

All access is via HTTP; therefore, a storage account is identified by a name used as the fifth-level domain name for the endpoint of the storage service.  For instance, a storage account with the name of snowball, would have an endpoint of snowball.table.core.windows.net for accessing Azure table storage.  Since the account name is part of a URI, that name must be unique across all of Windows Azure.

In addition to providing the unique account name, you must also specify where you want your data hosted.  Currently, Azure is hosted in six data centers across the world, divided into three regions (Asia, Europe, and the US).  You can select either a region or a sub-region to house your data.  If you select a region, your storage account will be assigned automatically to one (and only one) sub-region. Select your location wisely.  Typically you want to have your data and compute resources (web and worker roles) collocated to reduce latency and data transfer costs.  You also will want to locate your resources as closely as possible to the majority of your users, again to reduce latency.  The concept of affinity groups can help keep your various services collocated by attaching a meaningful name (like Azure@home) to a region or subregion and then specifying that affinity group name for additional services that require collocation (versus trying to remember what specific region or sub-region everything was deployed to). At this point all of the required information has been provided, and you can create the new storage service.

Once the service has been created, a project summary page with five sections is displayed.  Those sections include:

  1. Description – the text you entered for the Service Properties screen a few steps prior
  2. Cloud Storage –
    • URL endpoints for your storage; note all start with the storage account name, but tables, blobs, and queues each have a distinct URL.
    • Primary Access Key – a Base64 key used to ‘sign’ all storage requests
    • Secondary Access Key – a second key that can be employed when you need to retire the primary access key, but do not want to interrupt data access.
  3. Affinity Group – location of the storage service
  4. Content Delivery Network – whether or not the CDN option for blob storage has been enabled.
  5. Custom Domains – optional mapping of the Azure blob endpoint to a domain containing your own company’s URL.  For example, you can set up gallery.contoso.com to point to contoso.blob.core.windows.net and publicize your own organization’s URL versus the more generic Azure one. A Windows Azure team’s blog post goes into more detail on how to accomplish this.  Note, we won’t be using custom domains, CDN, or even blobs for that matter in Azure@home.

Table Storage in Azure@home

With your storage account provisioned, you’re now equipped to create and manipulate your data in the cloud, and for Azure@home that specifically means two tables: client and workunit.

As depicted on the left, data is inserted into the client table by WebRole, and that data consists of a single row including the name, team number, and location (lat/long) of the ‘folder’ for the Folding@home client application.  We’ll see in a subsequent post that multiple worker roles are polling this table waiting for a row to arrive so they can pass that same data on to the Folding@home console application provided by Stanford.

WebRole also reads from a second table, workunit, which contains rows reflecting both the progress of in-process work units and the statistics for completed work units.   The progress data is added and updated in that table by the various worker roles.

Reiterating what I mentioned above, Azure tables provide structured, non-relational, non-schematized storage (essentially a NoSQL offering).  Let’s break down what that means:

  • structured – Every row (or more correctly, entity)  in a table has a defined structure and is atomic.  An entity has properties (‘columns’), and those properties have specific data types like string, double, int, etc.  Contrast that with blob storage where each entity is just a series of bytes with no defined structure.  Every Azure table also includes three properties:
    • PartitionKey – a string value; the ‘partition’ has great significance in terms of performance and scalability:
      • the table clusters data in the same partition which enhances locality of reference, and conversely it distributes different partitions across multiple storage nodes
      • each partition is served by single storage node (which is beneficial for intra-partition queries but *can* present a bottleneck for updates)
      • transactional capabilities (Entity Group Transactions) apply only within a partition
    • RowKey – a string value, that along with the PartitionKey forms the unique (and, currently only) index for the table
    • Timestamp – a read-only DateTime value assigned by the server marking the time the entity was last modified

Recommendation: The selection of PartitionKey and RowKey is perhaps the most important decision you’ll make when designing an Azure table.  For a good discussion of the considerations you should make, be sure to read the Windows Azure Table whitepaper by Jai Haridas, Niranjan Nilakantan, and Brad Calder.  There are complementary white papers for blob, queue, and drive storage as well.

  • non-relational – This is not SQL Server or any of the other DBMSes you’ve grown to love.  There are no joins, there is no SQL; for that you can use SQL Azure.  But before you jump ship here back to the relational world, keep in mind that the largest database supported by SQL Azure is 50GB, and the largest table store supported by Windows Azure is 100TB (equivalent of 2000 maxed-out SQL Azure databases), and even then you can provision multiple 100TB accounts!  Relational capability comes at a price, and that price is scalability and performance.  Why you ask?  Check out the references for the CAP Theorem.
  • non-schematized – Each entity in a table has a well-defined structure; however, that structure can vary among the entities in the table.  For example, you could store employees and products in the same table, and employees may have properties (columns) like last_name and hire_date, while products may have properties like description and inventory.  The only properties that have to be shared by all entities are the three mentioned above (PartitionKey, RowKey, and Timestamp).  That said, in practice it’s fairly common to impose a fixed schema on a given table, and when using abstractions like ORMs and the StorageClient API, it’s somewhat of a necessity.

Let’s take a look at the schema we defined for the two tables supporting the Azure@home project:

client table

Attribute Type Purpose
PartitionKey String Azure table partition key
RowKey String Azure table row key
Timestamp DateTime read-only timestamp of last modification of the entity (row)
Latitude double user’s latitude entered via the default.aspx page
Longitude double user’s longitude entered via the default.aspx page
PassKey String optional passkey (only assignable via code)
ServerName String server name (used to distinguish results reported to distributed.cloudapp.net)
Team String Folding@home team id (default is 184157 taken from default.aspx but can be overridden in code)
UserName String Folding@home user name entered via the default.aspx page

The PartitionKey defined for this table is the UserName, and the RowKey is the PassKey.  Was that the best choice?  Well in this case, the point is moot since there will at most be one record in this table.

workunit table

Attribute Type Purpose
PartitionKey String Azure table partition key
RowKey String Azure table row key
Timestamp DateTime read-only timestamp of last modification of the entity (row)
CompleteTime DateTime time (UTC) that the assigned WorkerRole finished processing the work unit (a value of null indicates the work unit is still in progress)
DownloadTime String download time value for work unit (assigned by Stanford)
InstanceId String ID of  WorkerRole instance processing the given work unit(RoleEnvironment.CurrentRoleInstance.Id)
Name String Folding@home work unit name (assigned by Stanford)
Progress Int32 work unit percent complete (0 to 100)
StartTime DateTime time (UTC) that the assigned WorkerRole started processing the work unit
Tag String Folding@home tag string (assigned by Stanford)

The PartitionKey defined for this table is the InstanceId, and the RowKey is a concatenation of the Name, Tag, and DownloadTime fields.  Why those choices? 

  • The InstanceId defines the WorkerRole currently processing the work unit.  At any given time there is one active role instance with that ID, and so it’s the only one that’s going to write to that partition in the workunit table, avoiding any possibility of contention.  Other concurrently executing worker roles will also be writing to that table, but they will have different InstanceIds and thus target different storage nodes.  Over time each partition will grow since each WorkerRole retains its InstanceId after completing a work unit and moving on to another, but at any given time the partition defined by that InstanceId has a single writer.
  • Remember that the PartitionKey + RowKey defines the unique index for the entity.  When a work unit is downloaded from Stanford’s servers (by the WorkerRole instances), it’s identified primarily by a name and a tag (e.g., “TR462_B_1 in water” is the name of a work unit, and “P6503R0C68F89” is a tag).  Empirically we discovered that the same work units may be handed out to multiple different worker roles, so for our purposes the Name+Tag combination was not sufficient.  Adding the DownloadTime (also provided by the Stanford application) gave us the uniqueness we required. 

Programmatic Access

When I introduced Azure storage above, I mentioned how all the storage options share a consistent RESTful API.  REST stands for Representational State Transfer, a term coined by Dr. Roy Fielding in his Ph.D. dissertation back in 2000.  In a nutshell, REST is an architectural style that exploits the natural interfaces of the web (including a uniform API, resource based access, and using hypermedia to communicate state).  In practice, RESTful interfaces on the web

  • treat URIs as distinct resources (versus operations as is typical in SOAP-based APIs),
  • use more of the spectrum of available HTTP verbs (specifically, but not exclusively, GET for retrieving a resource, PUT for updating a resource, POST for adding a resource, and DELETE for removing a resource), and
  • embed hyperlinks in resource representations to navigate to other related resources, such as in a parent-child relationship.

OData

The RESTful architecture employed by Azure table storage specifically subscribes to the Open Data Protocol (or OData), an open specification for data transfer on the web.  OData is a formalization of the protocol used by WCF Data Services (née ADO.NET Data Services, née “Astoria”).

The obvious benefit of OData is that it makes Azure storage accessible to any client, any platform, any language that supports an HTTP stack and the Atom syndication format (an XML specialization) – PHP, Ruby, curl, Java, you name it.   At the lowest levels, every request to Azure storage – retrieving data, updating a value, creating a table – occurs via an HTTP request/response cycle (“the uniform interface” in this RESTful implementation). 

StorageClient

While it’s great to have such a open and common interface, as developers, our heads would quickly explode if we had to craft HTTP requests and parse HTTP responses for every data access operation (just as they would explode if we had code to the core ODBC API or parse TDS for SQL Server).  The abstraction of the RESTful interface that we crave is provided in the form of the StorageClient API for .NET, and there are abstractions available for PHP, Ruby, Java, and others.  StorageClient provides a LINQ-enabled model, with client-side tracking, that abstracts all of the underlying HTTP implementation.  If you’ve worked with LINQ to SQL or the ADO.NET Entity Framework, the programmatic model will look familiar.

To handle the object-“relational” mapping in Windows Azure there’s a bit of a manual process required to define your entities, which makes sense since we aren’t dealing with nicely schematized tables as with LINQ to SQL or the Entity Framework.  In Azure@home that mapping is incorporated in the AzureAtHomeEntities project/namespace, which consists of a single code file: AzureAtHomeEntities.cs.  There’s a good bit of code in that file, but it divides nicely into three sections:

  • a ClientInformation entity definition,
  • a WorkUnit entity definition, and
  • a ClientDataContext.

ClientInformation entity definition

It should be fairly obvious from comparing the code below to the client table schema that the ClientInformation class maps directly to that table.  LIne 12 begins the constructor for a new entity (which we’ll eventually see being used in the WebRole code).  In Lines 22 and 23, you’ll note the assignment of the PartitionKey and RowKey fields.  Those fields (along with the read-only Timestamp field) are defined in the base class TableServiceEntity.

    1:      public class ClientInformation : TableServiceEntity
    2:      {
    3:          public String UserName { get; set; }
    4:          public String Team { get; set; }
    5:          public String ServerName { get; set; }
    6:          public Double Latitude { get; set; }
    7:          public Double Longitude { get; set; }
    8:          public String PassKey { get; set; }
    9:   
   10:          public ClientInformation() { }
   11:          
   12:          public ClientInformation(String userName, String passKey, String teamName,
   13:              Double latitude, Double longitude, String serverName)
   14:          {
   15:              this.UserName = userName;
   16:              this.PassKey = passKey;
   17:              this.Team = teamName;
   18:              this.Latitude = latitude;
   19:              this.Longitude = longitude;
   20:              this.ServerName = serverName;
   21:   
   22:              this.PartitionKey = this.UserName;
   23:              this.RowKey = this.PassKey;
   24:          }
   25:      }

WorkUnit entity definition

This entity of course maps to the workunit table, and is only slightly more complex than the ClientInformation entity above, by virtue of the concatenation required to form the RowKey (Lines 26-29).

    1:      public class WorkUnit : TableServiceEntity
    2:      {
    3:          public String Name { get; set; }
    4:          public String Tag { get; set; }
    5:          public String InstanceId { get; set; }
    6:          public Int32 Progress { get; set; }
    7:          public String DownloadTime { get; set; }
    8:          public DateTime StartTime { get; set; }
    9:          public DateTime? CompleteTime { get; set; }
   10:   
   11:          public WorkUnit() { }
   12:          public WorkUnit(String name, String tag, String downloadTime, String instanceId)
   13:          {
   14:              this.Name = name;
   15:              this.Tag = tag;
   16:              this.Progress = 0;
   17:              this.InstanceId = instanceId;
   18:              this.StartTime = DateTime.UtcNow;
   19:              this.CompleteTime = null;
   20:              this.DownloadTime = downloadTime;
   21:   
   22:              this.PartitionKey = this.InstanceId;
   23:              this.RowKey = MakeKey(this.Name, this.Tag, this.DownloadTime);
   24:          }
   25:   
   26:          public String MakeKey(String n, String t, String d)
   27:          {
   28:              return n + "|" + t + "|" + d;
   29:          }
   30:      }

ClientDataContext

The ClientDataContext is the wrapper class for all of the data access, just as the DataContext is the entry class to LINQ to SQL and the ObjectContext is for the Entity Framework.  Here, the base class is TableServiceContext (which in turn extends DataServiceContext).  It’s via the TableServiceContext instance that you make the connection to Azure table storage (specifying endpoint and credentials) and enumerate entities in table storage.  Under the covers, the context manages object tracking on the client and handles the translation of CRUD (Create-Read-Update-Delete) operations to the underlying HTTP requests that make up the Azure Table Service REST API.

In this somewhat simplistic implementation, two queries exist on the underlying context, each returning the complete contents of one of the tables (client or workunit); however, we could have included additional IQueryable<T> properties to provide options to return subsets of the data as well.  Making these properties IQueryable enables us to compose additional query semantics in the WebRole and WorkerRole code.

    1:      public class ClientDataContext : TableServiceContext
    2:      {
    3:   
    4:          public ClientDataContext(String baseAddress, StorageCredentials credentials)
    5:              : base(baseAddress, credentials)
    6:          {
    7:          }
    8:   
    9:          public IQueryable<ClientInformation> Clients
   10:          {
   11:              get
   12:              {
   13:                   return this.CreateQuery<ClientInformation>("client");
   14:              }
   15:          }
   16:   
   17:          public IQueryable<WorkUnit> WorkUnits
   18:          {
   19:              get
   20:              {
   21:                  return this.CreateQuery<WorkUnit>("workunit");
   22:              }
   23:          }
   24:      }

At this point we’ve got a lot of great scaffolding set up, but we haven’t actually created our tables yet, much less populated them with data!  In the next post, I’ll revisit the WebRole implementation to show how to put the StorageClient API and the entities described above to use.