Azure@home Part 6: Synchronous Table Storage Pagination

This post is part of a series diving into the implementation of the @home With Windows Azure project, which formed the basis of a webcast series by Developer Evangelists Brian Hitney and Jim O’Neil. Be sure to read the introductory post for the context of this and subsequent articles in the series.

So where were we before vacations and back-to-school preparations got in my way!? Part 5 of this continuing series talked about the underlying REST protocol that’s used by the Azure Storage API, and in that discussion, we touched on two query classes:

  • DataServiceQuery, which will return at most 1000 entities, along with continuation tokens you can manage yourself to retrieve the additional entities fulfilling a given query, and
  • CloudTableQuery, which includes an Execute method you call once to return all of the data with no further client calls necessary.  Essentially, it handles the continuation tokens for you and makes multiple HTTP requests (each returning 1000 entities or less) as you materialize the query.  Recall, you can use the AsTableServiceQuery extension method on a DataServiceQuery to turn it into a CloudTableQuery and access this additional functionality.

Status.aspxIn the status.aspx page for Azure@home there are two collection displays – a Repeater control (InProgress) showing the work units in progress and a GridView (GridViewCompleted) displaying the completed work units.  The original code we’ve been looking at has the following line to retrieve all of the work units:

var workUnitList = ctx.WorkUnits.ToList<WorkUnit>();

We know now that code will result in at most 1000 entities being returned.  Presuming Azure@home has been chugging away for a while, there may be more than 1000 completed work units, and as implemented now, we’ll never get them all.  Additionally, since the table data is sorted by PartitionKey, and the partition key for the workunit table is the Azure WorkerRole instance ID, the entities you do get may change over time – it’s not as if you’re guaranteed to get the first 1000 work units completed or the last 1000.

It’s simple enough to replace the line above with

var workUnitList = ctx.WorkUnits.AsTableServiceQuery().ToList<WorkUnit>();

and all of the data will be returned, all 10 or all 10,000 entities – whatever happens to be in the table at the time.  Obviously we need a middle-ground here: control over the pagination without bringing down massive amounts of data that the user may never look at. 

While it’s conceivable you could have thousands of in-progress work units (each running in a worker role), that’s costly and beyond what you’ll be able to deploy as part of a typical Windows Azure account (20 roles is the default limit).  To save some time and complexity then, I’m not going to worry about paginating the InProgress Repeater control. 

You could though certainly accumulate a lot of completed work units, especially if you are leveraging an offer such as the Windows Azure One Month Pass.  So for purposes of this discussion, the focus will be on a pagination scheme for the GridView displaying those completed work units..

As you might have expected by the existence of two similar classes (DataServiceQuery and CloudTableQuery), there are actually two mechanisms you can use to implement the pagination, one synchronous (via DataServicesQuery) and the other asynchronous (via CloudTableQuery).  This post will focus on the former, and in the next post, we’ll transform that into an asynchronous implementation.

RefactoringSome Refactoring

It was my goal minimally disrupt the other code in Azure@home and confine modifications solely to status.aspx.   To accomplish that I had to do a bit of refactoring and introduce a utility class or two.  The completed implementation of the changes to status.aspx (and the code-behind files) are attached to this blog post, so you should be able to replace the original implementation with this code and give it a whirl.

PageData class

When you’re implementing a paging scheme with Azure table storage, you’ve got a couple of choices in terms of how to handle repeat requests for the same page.

  • Cache the data once it’s retrieved.   This approach minimizes the number of requests back to the service providing the data, but it also can make things more complex.  You have to maintain the cache, and if the data is volatile, refresh the cache periodically so the data is not stale.
  • Re-request the data for each pagination request.  This approach requires maintaining a list of continuation tokens that correspond to each page of results – kind of like guide words in a dictionary (the printed kind). Scott Densmore uses this approach in his blog post.

Since Scott’s example covers one of the options, I thought I’d go for the other, and just use the session state as my cache.  The completed data is never going to change, so why pay for an additional Azure storage transaction to get the same data?  Granted, I’m carrying around some baggage on the Web server now, and in a perfect world, I might use something like Velocity, but let’s leave that for a different day.

To encapsulate the data I need to store in session state, I created a class called PageData:

        protected class PageData
            public List<WorkUnit> InProgressList = new List<WorkUnit>();
            public List<CompletedWorkUnit> CompletedList = new List<CompletedWorkUnit>();
            public String PartitionKey = null;
            public String RowKey = null;
            public Boolean QueryResultsComplete = false;

maintaining the following information:

  • InProgressList – a list of work units being processed now by WebRole instances.  WorkUnit is defined as a TableServiceEntity in AzureAtHomeEntities.cs.
  • CompleteList – a list of work units already processed, but including only those that the user has paged through so far.  CompletedWorkUnit is a new class defined in status.aspx and defined below.
  • PartitionKey and RowKey – continuation tokens used in pagination
  • QueryResultsComplete – a boolean indicating all results have been downloaded, that is, that user has explicitly paged through all of the data.

CompletedWorkUnit class

Notice above that CompletedList is a collection of a new class, CompletedWorkUnit.   Why a new class? The original implementation materialized the entire workunit table into a workUnitList enumerable, and then used LINQ to Objects expressions to project the data source for both the Repeater and the GridView, as below:

 GridViewCompleted.DataSource =
     (from w in workUnitList
      where w.CompleteTime != null
      let duration = (w.CompleteTime.Value.ToUniversalTime() - 
    orderby w.StartTime descending
      select new
         Duration = String.Format("{0:#0} h {1:00} m",
            duration.TotalHours, duration.Minutes)

Then the magic of data binding matched each property of the anonymous class in the LINQ projection to a BoundField in the GridView by name.  That worked because all of the data was pulled down into memory and then manipulated with LINQ to Objects.

Now that we’re trying to query on an as-needed basis, we’re shifting from LINQ to Objects to the ADO.NET Data Services Client Library, and there are a couple of catches when addressing Azure table storage:

  1. You can’t include a null in the query expression (so the where clause in the LINQ expression above is invalid), and
  2. You can’t do a projection; you can only return the complete entity.

Didn’t Microsoft rename ADO.NET Data Services to WCF Data Services, so shouldn’t that be the WCF Data Services Client Library , not the ADO.NET Data Services Client Library

The name change officially applies to the .NET Framework 4 version of the protocol, and since Azure@home was developed under .NET 3.5, we’re using the older terminology here.  The constraints mentioned here also apply to .NET Framework 4 and the Open Data Protocol.

It’s the second issue above that necessitates a new class, since the Duration property is computed and not actually present in the workunit table.  What we’ll see later on is that after the query results are retrieved, the WorkUnit instances are reshaped into CompletedWorkUnits (below), which reintroduces a Duration property and enables the data binding to GridViewCompleted to succeed.

 protected class CompletedWorkUnit
     public String Name { get; set; }
     public String Tag { get; set; }
     public DateTime StartTime { get; set; }
     public DateTime? CompleteTime { get; set; }
     public CompletedWorkUnit(WorkUnit wu)
         this.Name = wu.Name;
         this.Tag = wu.Tag;
         this.StartTime = wu.StartTime;
         this.CompleteTime = wu.CompleteTime;
     public String Duration
             if (this.CompleteTime == null)
                 return "";
                 TimeSpan duration = this.CompleteTime.Value.ToUniversalTime() 
                                    - this.StartTime.ToUniversalTime();
                 return String.Format("{0:#0} h {1:00} m", 
                                duration.TotalHours, duration.Minutes);


Embracing the ASP.NET Page Lifecycle

The original attempt at status.aspx was pretty straightforward, and everything was in Page_Load, but now we need to be a bit more cognizant of when things happen in processing the page:

  • Pagination has been added to the GridView, so we’ve got a PageIndexChanging event to consider.
  • We’ve got session state to maintain.
  • The Refresh button now requires logic to clear the session state and reinitiate the retrieval from Azure table storage.

To accommodate the changes, the implementation of Page_Load has been broken up across a few methods and events in the ASP.NET page life cycle, as depicted below.

ASP.NET Page Lifecycle

The stages of the lifecycle relevant to this discussion are highlighted in blue; the two green boxes represent data retrieval logic refactored from the previous version of Page_Load; and the red box represents the session state, namely an instance of the PageData class that is initialized when the page is loaded and saved back to the session when the page is unloaded.

Let’s next walk through the code implementing each stage of the lifecycle.


    1:  protected void Page_Load(object sender, EventArgs e)
    2:  {
    3:      if (Session["PageData"] != null)
    4:          pageData = Session["PageData"] as PageData;
    5:      else
    6:          pageData = new PageData();
    8:      var cloudStorageAccount =
    9:          CloudStorageAccount.FromConfigurationSetting("DataConnectionString");
   10:      var cloudClient = new CloudTableClient(
   11:          cloudStorageAccount.TableEndpoint.ToString(),
   12:          cloudStorageAccount.Credentials);
   14:      // get client info
   15:      ctx = new ClientDataContext(
   16:              cloudStorageAccount.TableEndpoint.ToString(),
   17:              cloudStorageAccount.Credentials);
   19:      if (!this.IsPostBack)
   20:      {
   21:          // get name of the user (maintained in ViewState)
   22:          var clientInfo = ctx.Clients.FirstOrDefault();
   23:          litName.Text = ctx.Clients.FirstOrDefault() != null ?
   24:                  clientInfo.UserName : "Hello, unidentifiable user";
   26:          // ensure workunit table exists and retrieve data from Table Storage
   27:          cloudClient.CreateTableIfNotExist("workunit");
   28:          if (cloudClient.DoesTableExist("workunit"))
   29:          {
   30:              RetrieveInProgessUnits();
   31:              RetrieveCompletedUnits();
   32:          }
   33:          else
   34:          {
   35:              System.Diagnostics.Trace.TraceError(
                            "Unable to create 'workunit' table in Azure storage");
   36:          }
   37:      }
   38:  }

The retrieval of the session state occurs in Lines 3-6, followed by setting up the cloud storage account and context (Lines 8-17) just as in the original implementation.  

New is the check for a postback (Liine 19).  Only on the first rendering of the page will you retrieve the client name (from the client table), check that the workunit table exists, and populate the Repeater and GridView with some initial data.   Subsequent interaction occurs via the pagination of the GridView or pressing the Refresh button, both of which incur postbacks and are handled in other code on the page.

What about…  clientInfo on Line 22?  It’s part of the !this.IsPostback branch, so don’t we lose that data when paging through results? 

Yes and no.  clientInfo is lost on a postback, but we only needed the grab the UserName field once.  It’s assigned to litName on the first page view, and since litName has its ViewStateEnabled property set to true, the client name is retained across pagination and page refreshes.  litName, by the way,is the only control on this page with ViewState enabled.

Clearly the two methods RetrieveInProgressUnits (Line 30) and RetrieveCompletedUnits (Line 31) are the most significant ones, and they’re up next!


The original implementation of this functionality in status.aspx is below:

 InProgress.DataSource =
     (from w in workUnitList
         where w.CompleteTime == null
         let duration = (DateTime.UtcNow - w.StartTime.ToUniversalTime())
         orderby w.InstanceId ascending
         select new
             Duration = String.Format("{0:#0} h {1:00} m",
             duration.TotalHours, duration.Minutes)

and it’s been simplified to:

 private void RetrieveInProgessUnits()
     pageData.InProgressList =
             (from w in ctx.WorkUnits
             where w.Progress < 100
             select w).ToList<WorkUnit>();

which incorporates three significant changes

  • The projection has been removed, since projections are not supported in the ADO.NET Data Services Client against Azure table storage.  In fact, the projection isn’t even needed for the original implementation: the Duration property isn’t part of the user interface (a fact I didn’t realize until writing this blog post!)
  • The w.CompleteTime == null test isn’t supported in Azure table storage either, so in this case I opted for a nearly synonymous test of the Progress field.
  • Instead of assigning the result to the Repeater’s DataSource property directly, the data is retained in a session variable, pageData, with binding to occur later (in PreRenderComplete). We won’t be implementing pagination on this control, but I am still caching the data so that a subsequent postback doesn’t automatically re-retrieve the data.  That means that as a user pages though the GridView, the in-progress work unit data shown in the Repeater will not be refreshed.  You may or may not agree with that implementation, but it’s certainly easy enough to modify to your tastes.  My take is that if a user paging through, he’s focused on the completed data, not the in-progress data, and there is always the handy “Refresh this Page” button for him to use to get the latest and greatest in-progress (and completed) data.


Ok, this is the meat of this article.  The original code looked something like the InProgress.DataSource assignment above, where workUnitList was completely materialized on the client before the LINQ (to Objects) expression was constructed – I hereby officially deem that ‘old school’!  Now the plan is to query the Azure table service to show the completed workunits on demand, a page at a time, where the size of the page is determined by the PageSize of the GridView.  

Here’s how it works.  Say there are 1200 entities in the Azure table, and the GridView is configured with a PageSize of 100.  The first time the page is displayed, you’ll see 100 entities, and the pager control will indicate that a second page is available, but no more.  Behind the scenes a query is made for 101 entities.  Why 101?   100 to fill the page, and then one more to force the Page 2 indicator to appear in the pager control.  

Continuation tokens are saved from the response header of that first retrieval, so when the second page is requested via the pager control, another query can be made for just the next 100 entities, and so on until the user pages through all of the results.   If she doesn’t page through all the results, there’s no wasted bandwidth, and if she backs up to previous pages, they’re served immediately from the cache in the session state.  Every entity in the table is retrieved once - and only once.

Here’s the code:

    1:  protected void RetrieveCompletedUnits()
    2:  {
    3:      if (!pageData.QueryResultsComplete)
    4:      {
    5:          // select enough rows to fill a page of the GridView
    6:          // GridView.PageSize < 1000 or UI paradigm will fail
    7:          Int32 maxRows = GridViewCompleted.PageSize;
    9:          // add one if first page, to force a page 2 indicator
   10:          if (pageData.PartitionKey == null)
   11:              maxRows++;
   13:          // set up query
   14:          var qry = (from w in ctx.WorkUnits
   15:                     where w.Progress == 100
   16:                     select w).Take(maxRows) as DataServiceQuery<WorkUnit>;
   18:          // add continuation token (if there is one) to query
   19:          if (pageData.PartitionKey != null)
   20:          {
   21:              qry = qry.AddQueryOption("NextPartitionKey", 
   22:              if (pageData.RowKey != null)
   23:              {
   24:                  qry = qry.AddQueryOption("NextRowKey", pageData.RowKey);
   25:              }
   26:          }
   28:          // execute the query
   29:          var response = qry.Execute() as QueryOperationResponse<WorkUnit>;
   31:          // grab continuation token from response
   32:          if (response.Headers.ContainsKey(
   33:          {
   34:              pageData.PartitionKey = 
   35:              if (response.Headers.ContainsKey(
   36:              {
   37:                  pageData.RowKey = 
   38:              }
   39:          }
   41:          // if no continuation token, reached end of table
   42:          else
   43:          {
   44:              pageData.PartitionKey = null;
   45:              pageData.RowKey = null;
   46:              pageData.QueryResultsComplete = true;
   47:          }
   49:          // add newly retrieved data to the current collection
   50:          pageData.CompletedList.AddRange(
   51:              from wu in response.AsEnumerable<WorkUnit>()
   52:                  select new CompletedWorkUnit(wu)
   53:          );
   54:      }
   55:  }

Lines 5-11 set up how many entities (rows) to return, and Lines 14-16 set up (but do not execute) the query.   Prior to that initial retrieval, there are no continuation tokens set (PartitionKey == null), so the condition in Line 19 fails, and the query is actually executed in Line 29.

As you might recall from my last post, the Take extension method (Line 16) adds a $top query option to the resulting HTTP request, so the GET URL that goes out  looks like:        $filter=Progress%20eq%20100&$top=26

$filter comes from the where clause in Line 15, and the value of $top comes from the fact that the GridView’s PageSize is 25 (then add 1 for the initial retrieval to force a Page 2 link in the GridView pager). 

26 entities are retrieved (presuming there are that many), and the HTTP response headers look something like:

      HTTP/1.1 200 OK
     Cache-Control: no-cache
     Content-Type: application/atom+xml;charset=utf-8
     Server: Windows-Azure-Table/1.0 Microsoft-HTTPAPI/2.0
     x-ms-request-id: 2e75d787-ca85-4044-b0cc-0d6e462d910c
     x-ms-version: 2009-09-19
     x-ms-continuation-NextPartitionKey: 1!16!SW5zdGFuY2VfMDI2
     x-ms-continuation-NextRowKey: 1!44!Q29tcGxldGVkX1VuaXR8MDI2fGRvd25sb2FkdGltZQ-- 
     Date: Thu, 02 Sep 2010 01:09:37 GMT
     Content-Length: 34290

Note the inclusion of two x-ms-continuation headers specifying how to pick up the next page of results.  Lines 32-39 record these values in the session state (via pageData variable).  If there are no continuation headers, it signifies the query results are complete, so the pageData continuation tokens are reset, and a boolean flag is flipped.

Finally, Lines 50-53 take the result of the query, materializes it, and projects it into a new collection of CompletedWorkUnits (thus pulling in the calculated Duration property needed for data binding to the GridView).

Executed next in the page lifecycle are control events.  There are two control events of interest, clicking the Refresh button and paginating through the GridView. These events fire only on a postback, which means that the Page_Load processing merely sets up the cloud account and context variables; no retrieval of data occurs in Page_Load since that code is enclosed by a if (!this.IsPostback) condition ( Line 19 ).


Clicking the Refresh button is tantamount to retrieving the page the first time, so the code here essentially replicates what happens in Page_Load the first time through (with a couple of lines to reset the session state and GridView page).

 protected void btnRefresh_Click(object sender, EventArgs e)
     // refresh all the data on the page when button is clicked
     pageData = new PageData();
     GridViewCompleted.PageIndex = 0;


This is my favorite part of the code, it just seems amazingly simple and elegant!

    1:  protected void GridViewCompleted_PageIndexChanging(object sender, 
                      System.Web.UI.WebControls.GridViewPageEventArgs e)
    2:  {
    3:      // if on the last 'current' page, retrieve another chunk
    4:      if (e.NewPageIndex == GridViewCompleted.PageCount - 1)
    5:          RetrieveCompletedUnits();
    7:      // set requested page
    8:      GridViewCompleted.PageIndex = e.NewPageIndex;
    9:  }

Because of the way the page size was setup, as long as there are more entities to be retrieved, there will always be a next page link in the pager for the GridView.   The initial retrieve will imply there are two pages, and when the link for page 2 is selected, this event goes into action.

Line 4 checks to see if the user clicked the page link for the last page of the current GridView.  That could truly be the last of the data in the table, or it could just be the last page the user had requested thus far.  Which of these is the case is determined by RetrieveCompletedUnits (Line 5). 

If you revisit that code earlier in this post, you’ll note the first test is to see if the query results have been completely retrieved; if so, the method quickly returns.  If not, the pageData variable must be holding on to a continuation token value (PartitionKey and RowKey), and the next query built will include these values as query options (via AddQueryOption).  What gets issued via HTTP is something like:         `` $filter=Progress%20eq%20100&$top=25&         NextPartitionKey=1!16!SW5zdGFuY2VfMDI2&         NextRowKey=1!44!Q29tcGxldGVkX1VuaXR8MDI2fGRvd25sb2FkdGltZQ--

If the requested page index isn’t the last one (the condition inLine 4 evaluates to false), the results are already in the session variable (pageData.CompletedList), and the regular data binding semantics and GridView mechanisms will take care of displaying the right data.


Speaking of data binding, that’s where this event comes in.  After all of the control events have fired and the appropriate data has been gathered in the pageData variable, binding the data to the Repeater and GridView controls (and updating a few literals for aesthetics) is a simple affair:

 protected void Page_PreRenderComplete()
     // bind session data to the Repeater control
     InProgress.DataSource = pageData.InProgressList;
     // bind session data to the GridView control
     GridViewCompleted.DataSource = pageData.CompletedList;
     // update the labels
     litNoProgress.Visible = InProgress.Items.Count == 0;
     litCompletedTitle.Visible = GridViewCompleted.Rows.Count > 0;

Why did… I choose to implement the data binding in the PreRenderComplete event instead of LoadComplete?  Well, originally I actually had put it in LoadComplete, and that will work fine here, but I’m setting us up for the next post where we’ll talk about asynchronously handling the paging, and there PreRenderComplete figures prominently.


 protected void Page_Unload()
     Session["PageData"] = pageData;

See, there is such a thing as self-documenting code!


Tidying Up Loose Ends

There are still a couple of gotchas to be aware of in the implementation above that robust production code would want to accommodate.

First, there’s little in the way of exception handling – but that’s par for the course with blog posts like these!  

The biggest loose end though is that my pagination logic in RetrieveCompletedUnits disregards a pernicious scenario.  In the fine print of the MSDN topic on Query Timeout and Pagination it’s mentioned that

It is possible for a query to return no results but to still return a continuation header.

What the heck does that mean?!   Steve Marx has dedicated a post to it, and while some of the references in the post are obsolete (e.g., ExecuteAll and ExecuteAllWithRetries are no longer part of the StorageClient API), the situation described can still occur, and my code above is far from accommodating!  I have not run into such a scenario during my testing, but my application usage is light, and it’s quite likely my data partitions are collocated at the same storage node.  (In Azure@home, the partitions are defined by the CurrentRoleInstance.Id of the Worker role recording the data).

How might I fix the code?  In Line 29 of RetrieveCompletedUnits the fatal flaw is the assumption that I’m getting a complete set of data back from the query.  That is, I’m getting maxRows or whatever’s left at the tail end of the result set.  What could happen though, is that I get, say, six entities or even none, but also a continuation token.   If I really want to make sure I’m getting enough entities to fill a GridView page, I need to beef up my implementation, something like the following, in which I’ve highlighted the modifications to the code discussed earlier in the post.

 protected void RetrieveCompletedUnits()
    if (!pageData.QueryResultsComplete)
        // select enough rows to fill a page of the GridView
        // GridView.PageSize < 1000 or UI paradigm will fail
        Int32 maxRows = GridViewCompleted.PageSize;

        // add one if first page, to force a page 2 indicator
        if (pageData.PartitionKey == null)

// set up query
var qry = (from w in ctx.WorkUnits
where w.Progress == 100
select w).Take(maxRows) as DataServiceQuery<WorkUnit>;

// add continuation token (if there is one) to query
if (pageData.PartitionKey != null)
qry = qry.AddQueryOption("NextPartitionKey", pageData.PartitionKey);
if (pageData.RowKey != null)
qry = qry.AddQueryOption("NextRowKey", pageData.RowKey);

// execute the query
var response = qry.Execute() as QueryOperationResponse<WorkUnit>;

// grab continuation token from response
if (response.Headers.ContainsKey("x-ms-continuation-NextPartitionKey"))
pageData.PartitionKey = response.Headers["x-ms-continuation-NextPartitionKey"];
if (response.Headers.ContainsKey("x-ms-continuation-NextRowKey"))
pageData.RowKey = response.Headers["x-ms-continuation-NextRowKey"];

// if no continuation token, reached end of table
pageData.PartitionKey = null;
pageData.RowKey = null;
pageData.QueryResultsComplete = true;

            // add newly retrieved data to the current collection
            var returnedRows = response.ToList<WorkUnit>();
                from wu in returnedRows
                select new CompletedWorkUnit(wu)

            // still need rows to fill page size
            maxRows -= returnedRows.Count;

        } while ((!pageData.QueryResultsComplete) && (maxRows > 0));


Presuming the code above is correct… it wasn’t too hard to address the scenario that Steve described.  Instead of assuming I’ll get data back on each call to Execute, I built a loop around the retrieval logic so RetrieveCompletedUnits doesn’t return until it’s either got a full page of data or there’s no data left – regardless of how many actual REST requests are issued in the process.

Granted this post was a tad lengthy (when are mine not!), but hopefully it was straightforward enough for you to understand the pagination mechanism and how to code for it in your own applications.  The devil’s always in the details, but the primary takeaway here is

When using DataServiceQuery, always expect continuation tokens unless you’re specifying both PartitionKey and RowKey in the query (that is, you’re selecting a single entity.

In the next post, I’ll take this code and transform it using a later addition to the StorageClient API (CloudTableQuery) that will handle much of the continuation token rigmarole for you.