5 – Maximizing Availability, Scalability, and Elasticity

patterns & practices Developer Center

On this page: Download:
Maximizing Availability in Multi-Tenant Applications | Maximizing Scalability in Multi-Tenant Applications | Caching | SQL Database Federation | Shared Access Signatures | Content Delivery Network | Implementing Elasticity in Multi-Tenant Applications | Scaling Windows Azure Applications with Worker Roles | Example Scenarios for Worker Roles - Triggers for Background Tasks, Execution Model, The MapReduce Algorithm | Goals and Requirements | Performance and Scalability when Saving Survey Response Data | Summary Statistics | Geo-location in the Surveys Application | Making the Surveys Application Elastic | Scalability | Overview of the Solution | Options for Saving Survey Responses - Writing Directly to Storage, Using the Delayed Write Pattern, Handling Large Messages, Scaling the Worker Role Tasks, Comparing the Options, Keeping the UI Responsive when Saving Survey Responses, Minimizing the Number of Storage Transactions, The Impact on Other Parts of the System, Choosing between Blob and Table Storage | Options for Generating Summary Statistics - Scaling out the Generate Summary Statistics Task | Using Windows Azure Caching | Using the Content Delivery Network - Setting the Access Control for the BLOB Containers, Configuring the CDN and Storing the Content, Configuring URLs to Access the Content, Setting the Caching Policy | Hosting Tailspin Surveys in Multiple Locations - Synchronizing Survey Statistics | Autoscaling and Tailspin Surveys | Inside the Implementation | Saving the Survey Response Data Asynchronously | Calculating the Summary Statistics | Pessimistic and Optimistic Concurrency Handling | More Information Download code samples
Download PDF

This chapter explores how you can maximize performance and availability for multi-tenant applications that run in Windows Azure. This includes considering how you can ensure that the application is scalable and responsive, and how you can take advantage of the elasticity available in Windows Azure to minimize running costs while meeting performance requirements.

Topics you will see discussed in this chapter include maximizing availability through geo-location, caching, and by using the Content Delivery Network (CDN); maximizing scalability through the use of Windows Azure storage queues, background processing, and asynchronous code; and implementing elasticity by controlling the number of web and worker role instances that are deployed and executing.

Maximizing Availability in Multi-Tenant Applications

Multiple tenants sharing an instance of a role or other resource in Windows Azure increases the risk that the application becomes unavailable for several tenants. For example, if a multi-tenant worker role becomes unavailable it affects all of the tenants sharing the role, whereas the failure of a single-tenant worker role only affects that one tenant. These risks can increase if tenants have the ability to apply extensive customizations to the application. You must ensure that any extensions to the application, added by either the provider or a tenant, will not introduce errors that could affect the availability of the role.

Hh534484.note(en-us,PandP.10).gifJana Says:
Jana
                One of the major advantages Windows Azure offers is the ability to use and pay for only what you actually need, while being able to increase and decrease the resources you use on demand without being forced to invest in spare or standby capacity.</td>

In general, Windows Azure enables you to mitigate these risks by using multiple instances of resources. For example, you can run multiple instances of any web or worker role. Windows Azure will detect any failed role instances and route requests to other, functioning instances. Windows Azure will also attempt to restart failed instances.

However, running multiple instances of a role does impose some restrictions on your design. For example, because the Windows Azure load balancer can forward requests to any instance of a web role, either the role must be stateless or you must have a mechanism to share state across instances.

In the case of Windows Azure storage accounts, which contain blobs, tables, and queues, Windows Azure maintains multiple redundant copies. By default, Windows Azure uses geo-replication to make a copy of your data in another data center in addition to the multiple copies held in the data center that hosts your storage account.

Note

For more information about how Windows Azure protects your data, see the blog post Introducing Geo-replication for Windows Azure Storage.

It’s important, whether your application is multi-tenant or single-tenant, that you understand the impact of a failure of any element of your application. In particular, you must understand which failure conditions Windows Azure can handle automatically, and which failure conditions your application or your administrators must handle.

Hosting copies of your Windows Azure application in multiple datacenters is another scenario that can help to keep your application available. For example, you can use Windows Azure Traffic Manager to define failover policies in the event that a deployment of your application in a particular datacenter becomes unavailable for some reason. However, you must still carefully plan how you will store your data and determine the data center or centers where you will store each item of data.

Note

At the time of writing, Windows Azure Traffic Manager is a Community Technology Preview (CTP) release.

Maximizing Scalability in Multi-Tenant Applications

One of the reasons for running applications in Windows Azure is the scalability it offers. You can add resources to, or remove resources from your application as and when they are required. As mentioned previously, Windows Azure applications typically comprise multiple elements such as web and worker roles, storage, queues, virtual networks, and caches. One of the advantages of dividing the application into multiple elements is that you can then scale each element individually. You might be able to meet a change in your tenants’ processing requirements by doubling the number of worker role instances without increasing the number of message queues, or the size of your cache.

Hh534484.note(en-us,PandP.10).gifBharath Says:
Bharath
                You should also consider the granularity of the scalable elements of your Windows Azure application. For example, if you start with a small instance of a worker role rather than a large instance, you will have much finer control over the quantity of resources your application uses (and finer control over costs) because you can add or remove resources in smaller increments.</td>

In a multi-tenant application, you may decide to allocate groups of tenants to specific resources. For example, you could have one worker role that is dedicated to handling premium tenants, and another worker role that is dedicated to handling standard tenants. In this way, you could scale the worker role that supports premium tenants independently. This might be useful if you have a different SLA for premium tenants to that for standard tenants.

In addition to running multiple instances of a role, Windows Azure offers some features such as caching that are specifically designed to enhance the scalability of your application.

Caching

One of the most significant things you can do to enhance the scalability of your Windows Azure application is to use caching. Typically, you should try to cache frequently accessed data from blob storage, table storage, and databases such as SQL Database. Caching can reduce the latency in retrieving data, reduce the workload on your storage system, and reduce the number of storage transactions.

However, you must consider issues such as how much caching space you will need, your cache expiration policies, your strategies for ensuring that the cache is loaded with the correct data, and how much staleness you are willing to accept. Appendix E, “Maximizing Scalability, Availability, and Performance,” in the guide “Building Hybrid Applications in the Cloud on Windows Azure” explores caching strategies for a range of scenarios.

In a multi-tenant application, you also need to consider how to isolate tenant data within the cache. For more information about how to partition a cache in Windows Azure, see Chapter 4, “Partitioning Multi-Tenant Applications.”

Windows Azure offers two main caching mechanisms for application data: Windows Azure Shared Caching and Windows Azure Caching. For more information about the differences and similarities between these two approaches, see “Overview of Caching in Windows Azure” on MSDN.

Hh534484.note(en-us,PandP.10).gifPoe Says:
Poe
                If you decide to use a co-located Windows Azure cache in one or more of your role instances, you must consider the impact of allocating this memory to the cache instead of to the application.</td>

SQL Database Federation

You can use SQL Database Federation to scale out your SQL Database databases across multiple servers. SQL Database federations work by horizontally partitioning the data stored in your SQL Database tables across multiple databases. For more information, see “Federations in Windows Azure SQL Database.”

Hh534484.note(en-us,PandP.10).gifBharath Says:
Bharath
                The type of horizontal partitioning used in SQL Database federations is often referred to as “sharding.”</td>

Shared Access Signatures

Shared Access Signatures (SAS) can help to make your application scalable by enabling clients to access items stored in blobs, tables, or queues directly and offloading the work of mediating access to these resources from your web and worker roles. For example, your application could use SAS to make the contents of a blob, such as an image, directly accessible from a web browser, without the need either to make the blob public, or the need to read private blob data in a web role and then pass it on to the client browser.

You could also use SAS to give a worker role hosted in another Windows Azure subscription access to specific rows in a table without either revealing your storage account keys or using a worker role in your subscription to retrieve the data on behalf of the worker role in the other subscription.

For more information about SAS, see “Creating a Shared Access Signature” on MSDN.

Content Delivery Network

The Content Delivery Network (CDN) can host static application resources such as media elements in edge caches. This reduces the latency for clients requesting these items, and it enhances the scalability of your application by offloading some of the work typically performed by web roles.

For more information about the CDN, see “Caching” on the Windows Azure features page.

Implementing Elasticity in Multi-Tenant Applications

Elasticity refers to the ability of the application to dynamically scale out or in based on actual or anticipated demand for resources. The discussion in the previous section about the scalability of web and worker roles in multi-tenant applications also applies to elasticity. In particular, you must decide at what level you want to enable elasticity: for individual tenants, for groups of tenants, or for all of the tenants in the application. You also need to identify which elements of the application, such as roles, storage, queues, and caches, must be elastic.

Elasticity is particularly important for multi-tenant applications because levels of demand may be less predictable than for single-tenant applications. For a single-tenant application, you can probably predict peak usage times during the day and then schedule resource-hungry batch processing to other times. In a multi-tenant application, and especially those with users from around the globe, there are less likely to be predictable patterns of usage. However, if you have a large number of tenants it may be that variations in resource usage are averaged out.

Scaling Windows Azure Applications with Worker Roles

Because Windows Azure applications are typically made up of multiple elements such as web and worker roles, tables, blobs, queues, and caches you must consider how to design the application so that each element can support multi-tenancy within the application as a whole, keeping it available and scalable.

You must also consider how best to achieve these goals within the web and worker roles that run your application code. It is possible, though not advisable, to create a large and complex multi-tenant application that has just a single web role (along with any storage that it requires in the cloud). However, you must then ensure that your single web role can handle multiple tenants and be scalable and available. This will almost certainly require complex code that uses multi-threading and asynchronous behavior.

Use worker roles to implement asynchronous background processing tasks in your Windows Azure application.

One of the key reasons for using multiple worker role types in your application is to simplify some aspects of the design of your application. For example, by using worker roles you can easily implement background processing tasks, and by using queues you can implement asynchronous behavior. Furthermore, by using multiple role types you can scale each one independently. You might have four instances of your web role, two instances of worker role A, two instances of worker role B, and eight queues. You could also scale roles vertically, for example worker role A could be a small instance, and worker role B a large instance.

By using worker roles to handle storage interactions in your application, and queues to deliver storage insert, update, and delete requests to the worker role, you can implement load leveling. This is particularly important in the Windows Azure environment because both Windows Azure storage and SQL Database can throttle requests when the volume of requests gets too high.

Scalability is an issue for both single-tenant and multi-tenant architectures. Although it may be acceptable to allow certain operations at certain times to utilize most of the available resources in a single-tenant application (for example, calculating aggregate statistics over a large dataset at 2:00 A.M.), this is not an option for most multi-tenant applications where different tenants have different usage patterns.

Hh534484.note(en-us,PandP.10).gifBharath Says:
Bharath
                The timing of maintenance tasks is typically more difficult to plan for multi-tenant applications. In a single-tenant application there may be windows of time to perform system maintenance without affecting users. This is much less likely to be the case in a multi-tenant application. Chapter 7, “<a href="jj856952(v=pandp.10).md">Managing and Monitoring Multi-Tenant Applications</a>,” discusses this issue in more detail.</td>

You can use worker roles in Windows Azure to offload resource-hungry operations from the web roles that handle user interaction. These worker roles can perform tasks asynchronously when the web roles do not require the output from the worker role operations to be immediately available.

Example Scenarios for Worker Roles

The following table describes some example scenarios where you might partition the functionality of the application into separate worker roles for asynchronous job processing. Not all of these scenarios come from the Surveys application; but, for each scenario, the table specifies how to trigger the job and how many worker role instances it could use.

Scenario

Description

Solution

Update survey statistics

The survey owner wants to view the summary statistics of a survey, such as the total number of responses and average scores for a question. Calculating these statistics is a resource intensive task.

Every time a user submits a survey response the application puts a message in a queue named statistics-queue with a pointer to the survey response data.

Every ten minutes a worker role retrieves the pendingmessages from the statistics-queue queue and adjusts the survey statistics to reflect those survey responses. Only one worker instance should do the calculation over a queue to avoid any concurrency issues when it updates the statistics table.

Triggered by: Time

Execution model: Single worker or multiple workers with concurrency control

Dump survey data to Windows Azure SQL Database

The survey owner wants to analyze the survey data using a relational database. Transferring large volumes of data is a time consuming operation.

The survey owner requests the application export the responses for a survey. This action creates a row in a table named exports and puts a message in a queue named export-queuepointing to that row. Any worker can dequeue messages from the export-queue queue and execute the export. After it finishes, it updates the row in the exportstable with the status of the export process.

Triggered by: Message in queue

Execution model: Multiple workers

Store a survey response

Every time a respondent completes a survey, the response data must be reliably persisted to storage. The user should not have to wait while the application persists the survey data.

When a user submits a survey response the application writes the raw survey data to blob storage and puts a message in a queue named responses-queue. A worker role polls the responses-queue queue and, when a new message arrives, it stores the survey response data in table storage and puts a message in the statistics-queue queue to recalculate the statistics.

Triggered by: Message in queue

Execution model: Multiple workers

Heartbeat

Many workers running in a grid-like system have to send a “ping” at a fixed time interval to indicate to a controller that they are still active. The heartbeat message must be sent reliably without interrupting the worker’s main task.

Every minute each worker executes a piece of code that sends a “ping.”

Triggered by: Time

Execution model: Multiple workers

Note

You can scale the “Update survey statistics” scenario described in the preceding table by using one queue and one worker role instance for every tenant, or even for every survey. What’s important is that only one worker role instance should process and update data that is mutually exclusive within the dataset.

Looking at these example scenarios suggests you can categorize worker roles that perform background processing according to the criteria in the following table.

Trigger

Execution

Types of tasks

Time

Single worker

An operation on a set of data that changes frequently, and that requires an exclusive lock to avoid concurrency issues. Examples include aggregation, summarization, and denormalization.

You may have multiple workers running, but you need some kind of concurrency control to avoid corrupting the data. Depending on the scenario you need to choose between optimistic and pessimistic locking by determining which approach enables the highest throughput.

Time

Multiple workers

An operation on a set of data that is mutually exclusive from other sets so that there are no concurrency issues.

Independent operations that don’t work over data, such as a “ping.”

Message in a queue

Single or multiple workers

An operation on a small number of resources (for example, a blob or several table rows) that should start as soon as possible.

In the scenario where you use a single worker to update data that requires exclusive access, you may be able to use multiple workers if you can implement a locking mechanism to manage concurrent access. If you implement concurrency control with multiple workers to avoid corrupting shared data, you must choose between optimistic and pessimistic locking by determining which approach enables the highest throughput in your particular scenario.

Triggers for Background Tasks

The trigger for a background task could be a timer or a signal in the form of a message in a queue. Time-based background tasks are appropriate when the task must process a large quantity of data that trickles in little by little. This approach is cheaper and will offer higher throughput than an approach that processes each piece of data as it becomes available because you can batch the operations and reduce the number of storage transactions required to process the data. You can implement a time-based trigger by using a Timer object in a worker role that executes a task at fixed time interval.

Note

For flexibility in scheduling tasks you could use the Windows Task Scheduler within a worker role, or a specialized library such as Quartz.NET.

If the frequency at which new items of data become available is low and there is a requirement to process the new data as soon as possible, using a message in a queue as a trigger is the appropriate approach. You can implement a message-based trigger in a worker role by creating an infinite loop that polls a message queue for new messages. You can retrieve either a single message or multiple messages from the queue and execute a task to process the message or messages.

Hh534484.note(en-us,PandP.10).gifMarkus Says:
Markus
                You can pull multiple messages from a queue in a single transaction.</td>

Execution Model

In Windows Azure you typically execute background tasks by using worker roles. You could partition the application by having a separate worker role type for each type of background task in your application, but this approach means that you will need at least one separate worker role instance for each type of task. Often you can make better use of the available compute resources by having one worker role handle multiple types of tasks, especially when you have high volumes of data, because this approach reduces the risk of underutilizing your compute nodes. This approach, often referred to as role conflation, involves several trade-offs:

  • The first trade-off is the complexity and cost of implementing role conflation against the potential cost savings that result from reducing the number of running worker role instances.
  • The second trade-off is the cost savings of running fewer role instances against the flexibility of being able to scale the resources assigned to individual tasks.
  • The third trade-off is the time required to implement and test a solution that uses role conflation, and other business priorities such as time-to-market. In this scenario you can still scale out the application by starting up additional instances of the worker role.

Figure 1 shows the two scenarios for running tasks in worker roles.

Figure 1 - Handling multiple background task types

Figure 1

Handling multiple background task types

In the scenario where multiple instances of a worker role that can all execute the same set of task types you must distinguish between the task types where it is safe to execute the task in multiple worker roles simultaneously, and the task types where it is only safe to execute the task in a single worker role at a time.

To ensure that only one copy of a task can run at a time you must implement a locking mechanism. In Windows Azure you could use a message on a queue or a lease on a blob for this purpose. The diagram in Figure 2 shows that multiple copies of Tasks A and C can run simultaneously, but only one copy of Task B can run at any one time. One copy of Task B acquires a lease on a blob and runs; other copies of Task B will not run until they can acquire the lease on the blob.

Figure 2 - Multiple worker role instances

Figure 2

Multiple worker role instances

The MapReduce Algorithm

For some Windows Azure applications, being limited to a single task instance for complex long-running calculations may have a significant impact on performance and may limit the scalability of the application. In these circumstances the MapReduce algorithm may provide a way to parallelize the calculations across multiple worker role instances.

Hh534484.note(en-us,PandP.10).gifJana Says:
Jana
                For the Surveys application, speed is not a critical factor in the calculation of the summary statistics. Tailspin is willing to tolerate a delay while this summary data is calculated, so it does not use MapReduce. </td>

The original concepts behind MapReduce come from the map and reduce functions that are widely used in functional programming languages such as Haskell, F#, and Erlang. In the current context, MapReduce is a programming model that enables you to parallelize operations on a large dataset. In the case of the Surveys application, Tailspin considered using this approach to calculate the summary statistics by using multiple, parallel tasks instead of a single task. The benefit would be to speed up the calculation of the summary statistics by using multiple worker role instances.

Note

Hadoop on Windows Azure provides a framework that enables you to optimize the type of operations that benefit from the MapReduce programming model. For more information, see “Introduction to Hadoop on Windows Azure.”

Goals and Requirements

This section describes the availability, scalability, and elasticity goals and requirements that Tailspin has for the Surveys application.

Performance and Scalability when Saving Survey Response Data

When a user completes a survey, the application must save the user’s answers to the survey questions to storage so that the survey creator can access and analyze the results as required. The way that the application saves the summary response data must enable the Surveys application to meet the following three requirements:

  • The owner of the survey must be able to browse the results.
  • The application must be able to calculate summary statistics from the answers.
  • The owner of the survey must be able to export the answers in a format that enables detailed analysis of the results.

Tailspin expects to see a very large number of users completing surveys, and so the process that initially saves the data should be as efficient as possible. The application can handle any processing of the data after it has been saved by using an asynchronous worker process. For information about the design of this background processing functionality in the Surveys application, see the section “Partitioning Web and Worker Roles” in Chapter 4, “Partitioning Multi-Tenant Applications,” of this guide.

The focus in this chapter is on the way the Surveys application stores the survey answers. Whatever type of storage the Surveys application uses, it must be able to support the three requirements listed earlier while ensuring the application remains scalable. Storage costs are also a significant factor in the choice of storage type because survey answers account for the majority of the application’s storage requirements; both in terms of space used and the number of storage transactions required.

Hh534484.note(en-us,PandP.10).gifJana Says:
Jana
                Depending on the volume of survey responses received, transaction costs may become significant because calculating summary statistical data and exporting survey results will require the application to read survey responses from storage.</td>

Summary Statistics

Tailspin anticipates that some surveys may have thousands, or even hundreds of thousands of respondents, and wants to make sure that the public website remains responsive for all users at all times. At the same time, survey owners want to be able to view summary statistics calculated from the survey responses submitted to date.

In addition to browsing survey responses, subscribers must be able to view some basic summary statistics that the application calculates for each survey, such as the total number of responses received, histograms of the multiple-choice results, and aggregations such as averages of the range results. The Surveys application provides a predetermined set of summary statistics that cannot be customized by subscribers. Subscribers who want to perform a more sophisticated analysis of their survey responses can export the survey data to a Windows Azure SQL Database instance.

Calculating summary statistics is an expensive operation if there are a large number of responses to process.

Because of the expected volume of survey response data, Tailspin anticipates that generating the summary statistics will be an expensive operation because of the large number of storage transactions that must occur when the application reads the survey responses. Tailspin wants to have a different SLA for premium and standard subscribers. The Surveys application will prioritize updating the summary statistics for premium subscribers over updating the summary statistics for standard subscribers.

The public site where respondents fill out surveys must always have fast response times when users save their responses, and it must record the responses accurately so that there is no risk of any errors in the data when a subscriber comes to analyze the results.

The developers at Tailspin also want to be able to run comprehensive unit tests on the components that calculate the summary statistics without any dependencies on Windows Azure storage.

Hh534484.note(en-us,PandP.10).gifMarkus Says:
Markus
                There are also integration tests that verify the end-to-end behavior of the application using Windows Azure storage.</td>

Geo-location in the Surveys Application

Tailspin plans to offer subscriptions to the Surveys application to a range of users, from large enterprises to individuals. These subscribers could be based anywhere in the world, and may want to run surveys in other geographic locations. Each subscriber will select a geographic location during the on-boarding process; this location will be where the subscriber creates surveys, accesses the survey response data, and is also the default location for publishing surveys. Windows Azure allows you to select a geographic location for your Windows Azure services so that you can host your application close to your users.

The Surveys application is a “geo-aware” service.

Tailspin wants to allow subscribers to the Surveys service to override their default geographical location when they publish a survey. By default, a U.S. based subscriber publishes surveys to a U.S. based instance of the Surveys application, and a European subscriber would probably want to choose a Europe based service. However, it’s possible that a subscriber might want to run a survey in a different geographic region than the one the subscriber is located in. Figure 3 shows how a U.S. based subscriber might want to run a survey in Europe:

Figure 3 - A U.S. based subscriber running a survey in Europe

Figure 3

A U.S. based subscriber running a survey in Europe

Hh534484.note(en-us,PandP.10).gifPoe Says:
Poe
                You can check the current status of any Windows Azure datacenter on the “<a href="https://www.microsoft.com/windowsazure/support/status/servicedashboard.aspx">Windows Azure Service Dashboard</a>.”</td>

Of course, this doesn’t address the question of how users will access the appropriate datacenter. If a survey is hosted in only one datacenter, the subscriber would typically provide a link for users that specifies the survey in that datacenter; for example, http://eusurveys.tailspin.com/tenant1/europesurvey. A tenant could also use a CNAME in its DNS configuration to map an address such as http://eu.tenant1.com/surveys/tenant1/europesurvey to the actual URL of the survey installed in the North Europe datacenter at http://eusurveys.tailspin.com/tenant1/europesurvey.

However, if a subscriber decides to run an international survey and host it in more than one datacenter, Tailspin could allow it to configure a Windows Azure Traffic Manager policy that routes users’ requests to the appropriate datacenter—the one that will provide the best response times for their location.

For more information, see the section “Reducing Network Latency for Accessing Cloud Applications with Windows Azure Traffic Manager” in Appendix E of the guide “Building Hybrid Applications in the Cloud on Windows Azure.”

Making the Surveys Application Elastic

In addition to ensuring that Tailspin can scale out the Surveys application to meet higher levels of demand, Tailspin wants the application to be elastic and automatically scale out during anticipated and unexpected increases in demand for resources. The application should also automatically release resources when it no longer needs them in order to control its running costs.

Hh534484.note(en-us,PandP.10).gifJana Says:
Jana
                Tailspin expects that elasticity will be important for the public web site and the worker role. However, usage of the private subscriber web site will be much lower and Tailspin does not expect to have to scale this site automatically.</td>

Scalability

In addition to partitioning the application into web and worker roles, queues, and storage, Tailspin plans to investigate any other features of Windows Azure that might enhance the scalability of the application. For example, it will evaluate whether the Surveys application will benefit from using the Content Delivery Network (CDN) to share media resources and offload some of the work performed by the web roles. It will also evaluate whether Shared Access Signatures (SAS) will reduce the workload of worker roles by making blob storage directly and securely available to clients.

Tailspin also wants to be able to test the application’s behavior when it encounters high levels of demand. Tailspin wants to verify that the application remains available to all its users, and that the automatic scaling that makes the application elastic performs effectively.

The scalability of the solution can be measured only by stress testing the application. Chapter 7, “Managing and Monitoring Multi-Tenant Applications,” outlines the approach that Tailspin took to stress test the Surveys application, and describes some of its findings.

Overview of the Solution

This section describes the approach taken by Tailspin to meet the goals and requirements that relate to making the application available, scalable, and elastic.

Options for Saving Survey Responses

As you saw in Chapter 3 of this guide, Tailspin chose to use Windows Azure blob storage to store survey responses submitted by users filling out surveys in the public survey website. You will see more details in this section of how Tailspin made that decision, and the factors it considered.

In addition to the two options, writing directly to storage and using the delayed write pattern, that are discussed below, Tailspin also considered using shared access signatures to enable the client browser to save survey responses directly to blob storage and post a notification directly to a message queue. The benefit of this approach would be to offload the work of saving survey response data from the web role. However, they discounted this approach because of the complexity of implementing a reliable cross-browser solution and because of the loss of control in the web role over the process of saving survey responses.

Writing Directly to Storage

Figure 4 shows the process Tailspin implemented for saving the survey responses by writing them directly to blob storage using code running in the web role instances.

Figure 4 - Saving survey responses and generating statistics

Figure 4

Saving survey responses and generating statistics

Figure 4 also shows how the worker role instances collect each new set of responses from storage and uses them to update the summary statistics for that survey. Not shown in this figure is the way that the web role informs the worker role that a new set of answers has been saved in blob storage. It does this by sending a message containing the identifier of the new set of survey answers to a notification queue that the worker role listens on.

Amongst the concerns the developers had when choosing a storage mechanism was that saving a complete set of answers directly to Windows Azure storage from the web role could cause a delay (shown as Tp in Figure 4 at the crucial point when a user has just completed a survey. If a user has to wait while the answers are saved, he or she may decide to leave the site before the operation completes. To address this concern, the developers considered implementing the delayed write pattern.

Using the Delayed Write Pattern

The delayed write pattern is a mechanism that allows code to hand off tasks that may take some time to complete, without needing to wait for them to finish. The tasks can execute asynchronously as background processes, while the code that initiated them continues to other perform other work or returns control to the user.

The delayed write pattern is particularly useful when the tasks that must be carried out can run as background processes, and you want to free the application’s UI for other tasks as quickly as possible. However, it does mean that you cannot return the result of the background process to the user within the current request. For example, if you use the delayed write pattern to queue an order placed by a user, you will not be able to include the order number generated by the background process in the page you send back.

In Windows Azure, background tasks are typically initiated by allowing the UI to hand off the task by sending a message to a Windows Azure storage queue. Because queues are the natural way to communicate between the roles in a Windows Azure application, it’s tempting to consider using them for an operation such as saving data collected in the UI. The UI code can write the data to a queue and then continue to serve other users without needing to wait for operations on the data to be completed.

Figure 5 shows the delayed write pattern that the Surveys application could use to save the results of a filled out survey to Windows Azure storage.

Figure 5 - Delayed write pattern for saving survey responses in the Surveys application

Figure 5

Delayed write pattern for saving survey responses in the Surveys application

Based on tests that Tailspin performed, writing to a queue takes approximately the same time as writing to blob storage, and so there is no additional overhead for the web role compared to saving the data directly to blob storage when using the delayed write pattern.

In this scenario a user browses to a survey, fills it out, and then submits his or her answers to the Surveys website. The code running in the web role instance puts the survey answers into a message on a queue and returns a “Thank you” message to the user as quickly as possible, minimizing the value of Tp in Figure 5. One or more tasks in the worker role instances are then responsible for reading the survey response from the queue, saving it to Windows Azure storage, and updating the summary statistics. This operation must be idempotent to avoid any possibility of double counting and skewing the results.

Hh534484.note(en-us,PandP.10).gifBharath Says:
Bharath
                Surveys is a “geo-aware” application. For example, a European company might want to run a survey in the U.S. but analyze the data locally in Europe; it could use a copy of the Surveys website and queues running in a datacenter in the U.S., and use worker roles and a storage account hosted in a datacenter in Europe. Moving data between data centers will incur bandwidth costs.</td>

Handling Large Messages

There is a 64 kilobyte (KB) maximum size for a message on a Windows Azure queue, or 48 KB when using Base64 encoding for the message, so the approach shown in Figure 5 works only if the size of each survey response is less than the maximum. In most cases, except for very large surveys, it’s unlikely that the answers will exceed 48 KB but Tailspin must consider how it will handle this limitation.

Hh534484.note(en-us,PandP.10).gifMarkus Says:
Markus
                When you calculate the size of messages you must consider the effect of any encoding, such as Base64, you use to encode the data before you place it in a message.</td>

One option would be to implement a hard limit on the total response size by limiting the size of each answer, or by checking the total response size using JavaScript code running in the browser. However, Tailspin wants to avoid this as it may limit the attractiveness of its service to some subscribers.

Figure 6 shows how Tailspin could modify the delayed write pattern solution to handle survey results that are greater than 64 KB in size. It includes an optimization by saving messages that are larger than 64 KB to Windows Azure blob storage and placing a message on the “Big Surveys” queue to notify the worker role, which will read these messages from blob storage. Messages that are smaller than 64 KB are placed directly onto a queue as in the previous example.

Hh534484.A817C5C391159725446694711B6912D9(en-us,PandP.10).png

Figure 6

Handling survey results greater than 64 KB in size

The worker role now contains two tasks dedicated to saving survey responses and updating the summary statistics:

  • Task 1 polls the “Small Surveys” queue and picks up the sets of answers. Then (not shown in the figure) it writes them to storage and updates the summary statistics.
  • Task 2 polls the “Big Surveys” queue and picks up messages containing the identifier of the new answers sets that the web role has already written to storage. Then (not shown in the figure) it retrieves the answers from storage and uses them to update the summary statistics.

Notice that, for messages larger than the limit for the queue, the process is almost identical to that described in Figure 4 where Tailspin was not using the delayed write pattern.

Note

An alternative approach to overcoming the constraint imposed by the maximum message size in Windows Azure queues is to use Windows Azure Service Bus instead. Service Bus queues can handle messages up to 256 KB in size, or 192 KB after Base64 encoding. For more details see “Windows Azure Queues and Windows Azure Service Bus Queues - Compared and Contrasted.”

Another variation on the approach described here is to use a single queue that transports two different message types. One message type holds a full survey response as its payload; the other message type holds the address of the blob where the big survey response is stored. You can then implement a RetrieveSurvey method in your messaging subsystem that returns either a small or big survey response from the queue to your worker role. Your messaging subsystem now encapsulates all of the logic for handling different response sizes, hiding it from the rest of your application.

Scaling the Worker Role Tasks

In the initial solution Tailspin implemented, writing directly to storage from the web role, the worker role instances had only one task to accomplish: updating the summary statistics. When using the delayed write pattern the worker roles must accomplish two tasks: saving the answers to storage (where the answer set is smaller than the limit for a queue) and updating the summary statistics.

It’s possible that Tailspin will want, or need, to scale these two tasks separately. It’s vital that new answers are saved to storage as quickly as possible, whereas calculating the summary statistics may not be such an urgent requirement. The summary statistics can be recalculated from the answers should a failure occur, but the converse is not possible. Tailspin also wants to be able to differentiate the service level for premium and standard subscribers by ensuring that summaries for premium subscribers are available more quickly.

To scale the tasks separately Tailspin would need to use two separate worker roles:

  • A worker role that just updates the statistics by polling a queue for messages containing the identifier of new answer sets. In Figure 6 this is the “Big Surveys” queue that the web role uses to inform worker roles that it has saved directly to storage a new set of answers that is larger than the limit for a queue.
  • A worker role that just saves new answer sets to storage by polling a queue for messages that contain the answers. In Figure 6 this is the “Small Surveys” queue that the web role uses to post sets of answers that are smaller than the limit for a queue to worker roles. However, this worker role would then need to inform the worker role that updates the statistics that it has saved to storage a new set of answers. It would do this by sending a message containing the identifier of the new answer set to the “Big Surveys” queue shown in Figure 6.

To provide different levels of service, such as the speed of processing summary statistics, Tailspin could use separate queues for premium and standard subscribers and configure the worker role that saves the answers to storage to send the notification message to the appropriate queue. The worker role instances that poll these two queues could do so at different rates, or they could use an algorithm that gives precedence to premium subscribers.

Comparing the Options

To identify the best solution for saving survey responses in the Surveys application, the developers at Tailspin considered several factors:

  • How to minimize the delay between a user submitting a set of answers and the website returning the “Thank you” page.
  • The opportunities for minimizing the storage transaction costs encountered with different approaches for saving the answers and calculating the summary statistics.
  • The impact on other parts of the system from the approach they choose for saving the answers.
  • The choice of persistent storage mechanism (blobs or tables) that best suits the approach they choose for saving and processing the answers, and will have the least impact on other parts of the system while still meeting all their requirements.

To help them understand the consequences of their choices, Tailspin’s developers created the following table to summarize the operations that must be executed for each of the three approaches they considered.

Option

Answer set size

Web role storage transactions

Worker role storage transactions

Total # of transactions

Write answers directly to storage from the web role.

Any

Save answers to storage.

Post message to notification queue.


Read message from notification queue.

Read answers from storage.

Read current summary statistics.

Write updated summary statistics.

Call complete on notification queue.

Seven

Use the delayed write pattern with the worker role handling the tasks of writing to storage and calculating summary statistics.

< 64 KB

Post answers to “Small Surveys” queue.


Read answers from “Small Surveys” queue.

Write answers to storage.

Read current summary statistics.

Write updated summary statistics.

Call complete on “Small Surveys” queue.

Six

> 64 KB

Save answers to storage.

Post message to “Big Surveys” queue.


Read message from “Big Surveys” queue.

Read answers from storage.

Read current summary statistics.

Write updated summary statistics.

Call complete on “Big Surveys” queue.

Seven

Use the delayed write pattern with separate worker roles for the tasks of writing to storage and calculating summary statistics.

< 64 KB

Post answers to “Small Surveys” queue.



Save survey worker role:

Read answers from “Small Surveys” queue.

Write answers to storage.

Call complete on “Small Surveys” queue.

Post message to “Big Surveys” queue.

Update statistics worker role:

Read message from “Big Surveys” queue.

Read answers from storage.

Read current summary statistics.

Write updated summary statistics.

Call complete on “Big Surveys” queue.

Ten

> 64 KB

Save answers to storage.

Post message to “Big Surveys” queue.


Update statistics worker role:

Read message from “Big Surveys” queue.

Read answers from storage.

Read current summary statistics.

Write updated summary statistics.

Call complete on “Big Surveys” queue.

Seven

Some points to note about the contents of the table are:

  • Worker roles can read messages from a queue in batches, which reduces the storage transaction costs because reading a batch of messages counts as a single transaction. However, this means that there may be a delay between answers being submitted and the worker role processing them and, when using the delayed write pattern with small answer sets, saving them to storage.
  • Using the delayed write pattern with two separate worker role types allows you to scale the two tasks (writing to storage and calculating the summary statistics) separately. This means that the two tasks must access the answers separately and in the correct order. One task reads them from the answers queue, writes them to storage, and only then posts a message to a queue to indicate new answers are available. The second task reads the answers from storage when the message is received, and updates the statistics.
  • Using the delayed write pattern when messages are larger than the limit for the queue is not really the delayed write pattern at all. It is fundamentally the same as the original approach of saving the answers direct to storage.
  • Because the majority of answer sets are likely to be smaller than the limit for the queue, the third option that uses separate worker role types will typically use more storage transactions than if there was a predominance of large answer sets.

Keeping the UI Responsive when Saving Survey Responses

A key design goal is to minimize the time it takes to save a survey response and return control to the UI. Tailspin does not want survey respondents to leave the site while they wait for the application to save their survey responses. Irrespective of the way that the survey responses are saved to storage, the Surveys application will use a task in the worker role instances to calculate and save the summary statistics in the background after the responses are saved.

The initial approach that Tailspin implemented in the Surveys application requires the web role to perform two operations for each set of answers that users submit. It must first save them to storage and then, if that operation succeeds, post a message to the notification queue so that worker roles know there is a new survey response available.

When using the delayed write pattern and the total size of the answer set is smaller than the limit for Windows Azure storage queues, the web role instances need to perform only one operation. They just need to post the answers to a queue, and all of the processing will occur in the background. The worker roles will write the answers to storage and update the summary statistics; meanwhile the web role can return the “Thank you” page immediately.

If the total size of the answer set is larger than the limit for Windows Azure storage queues, the web role instances will need to perform two operations: saving the answers and then sending a message to the notification queue. However, it is expected that the vast majority of surveys will not produce answer sets that are larger than the limit for a queue.

Even if Tailspin wants to offer premium subscribers the capability for their summary statistics to be updated more quickly than those of standard subscribers, and does this by using two separate worker role types, the web role will still need to perform only one operation unless the answers set size is larger than the limit for a queue.

Therefore, the most efficient option from the point of view of minimizing UI delay will be to use the delayed write pattern because, in the vast majority of cases, it will require only a single operation within the web role code.

Minimizing the Number of Storage Transactions

Reading and writing survey responses account for the majority of storage transactions in the Tailspin Surveys application, and with high monthly volumes this can account for a significant proportion of Tailspin’s monthly running costs.

The option that requires the least number of storage transactions is the delayed write pattern with the worker role saving the answers and calculating the summary as one operation. This option will require an additional storage transaction for survey answers larger than the limit for a queue, but this is not expected to occur very often. However, as you saw in the previous section, this option limits the capability to scale the tasks separately in the worker role, and may make using separate queues for premium and standard subscribers more complicated.

The next best option is to write the answers directly to storage using code in the web role. To save a complete survey response directly to blob storage requires a single storage transaction. If the Surveys application used Windows Azure table storage instead of blob storage, and can use a single entity group transaction to save a survey answers to table storage, it could also save each complete survey response in a single transaction.

Hh534484.note(en-us,PandP.10).gifJana Says:
Jana
                To be able to save a complete survey response in a single entity group transaction, the survey answer set must have fewer than 100 answers, and must be stored in a single table partition. An entity group transaction batches a group of changes to a single table partition into a single, atomic operation that counts as a single storage transaction. An entity group transaction must update fewer than 100 entities and the total request size must be less than 4 MB in size. </td>

The third option, using the delayed write pattern with separate worker role types for saving the answers and updating the summary statistics will require the highest number of storage transactions for the vast majority of survey answers.

The Impact on Other Parts of the System

The decision on the type of storage to use (blob or table) and whether to use the delayed write pattern can have an impact on other parts of the application, and on the associated systems and services. Tailspin’s developers carried out a set of spikes to determine whether using blob storage would make it difficult or inefficient to implement the other parts of the Surveys application that read the survey response data. This includes factors such as paging through survey responses in the UI, generating summary statistics, and exporting to a SQL Database instance.

They determined that using blob storage for storing survey response data will not introduce any significant additional complexity to the implementation, and will not result a significant increase in the number of storage transactions within the system. Chapter 3, “Choosing a Multi-Tenant Data Architecture,” describes how Tailspin implemented both paging through survey responses stored in blob storage and exporting survey response data to SQL Database. The section “Options for Generating Summary Statistics” in this chapter describes how Tailspin implemented the export feature in the Surveys application.

The delayed write pattern has the advantage that it makes it easy to perform any additional processing on a survey response before it is saved, without delaying the UI. This processing might include formatting the data or adding contextual information. The web role places the raw survey response in a message. The worker role retrieves the message from the queue, performs any required processing on the data in the message, and then saves the processed survey response.

Pre-processing the data before the application saves it is typically used to avoid the need to perform the processing every time the data is read. If the application writes the data once, but reads it n times, the processing is performed only once, and not n times.

Tailspin did not identify any additional processing that the Surveys application could to perform on the survey responses that would help to optimize the processes that read the survey data. The developers at Tailspin determined that they could implement all of these processes efficiently, whether the survey response data was stored in blob or table storage.

Choosing between Blob and Table Storage

The initial assumption of Tailspin’s developers during the early design process for the Surveys application was that it should save each survey response as a set of rows in Windows Azure table storage. However, before making the final decision, the developers carried out some tests to find the comparable speed of writing to table storage and blob storage. They created some realistic spikes to compare how long it takes to serialize and save a survey response to a blob with how long it takes to save the same survey response as a set of entities to table storage in a single entity group transaction. They found that, in their particular scenario, saving to blob storage is significantly faster.

When using the delayed write pattern, the additional time to save the survey response data will affect only the worker role. The web role UI code will need only to write the survey responses to a queue. There will be no additional delay for users when submitting their answers. However, the added overhead in the worker role may require extra resources such as additional instances, which will increase the running cost of the application.

If Tailspin chose not to use the delayed write pattern, the increase in time for the web role to write to table storage will have an impact on the responsiveness of the UI. Using table storage will also have an impact when the delayed write pattern is used and the answer sets are predominantly larger than the limit for a queue. Therefore, in order to allow for this possibility and to make future extensions to the application that may require larger messages to be accepted, Tailspin chose to store the answers in blob storage.

Hh534484.note(en-us,PandP.10).gifMarkus Says:
Markus
                If you use table storage you must consider how your choice of partition key affects the scalability of your solution both when writing and reading data. If we chose to store the survey answers in table storage we’d need to choose a partition key that allows the Surveys application to save each survey response using an entity group transaction, and read survey responses efficiently when it calculates summary statistics or exports data to SQL Database.</td>

Options for Generating Summary Statistics

To meet the requirements for generating summary statistics, the developers at Tailspin decided to use a worker role to handle the task of generating these from the survey results. Using a worker role enables the application to perform this resource intensive process as a background task, ensuring that the web role responsible for collecting survey answers is not blocked while the application calculates the summary statistics.

Based on the framework for worker roles described in Chapter 4, “Partitioning Multi-Tenant Applications,” this asynchronous task is one that will be triggered on a schedule. In addition, because it updates a single set of results, it must run as a single instance process or include a way to manage concurrent access to each set of summary data.

To calculate the survey statistics, Tailspin considered two basic approaches. The first approach is for the task in the worker role to retrieve all the survey responses to date, recalculate the summary statistics, and then save the summary data over the top of the existing summary data. The second approach is for the task in the worker role to retrieve all the survey response data that the application has saved since the last time the task ran, and use this data to adjust the summary statistics to reflect the new survey results.

Note

You can use a queue to maintain a list of all new survey responses. The summarization task is triggered on a schedule that determines how often the task should look at the queue for new survey results to process.

The first approach is the simplest to implement, because the second approach requires a mechanism for tracking which survey results are new. The second approach also depends on it being possible to calculate the new summary data from the old summary data and the new survey results, without rereading all the original survey results.

For many types of summary statistic (such as total, average, count, and standard deviation) it is possible to calculate the new values based on the current values and the new results. For example if you have already received five answers to a numeric question and you know that the average of those answers is four, then if you receive a new response with an answer of 22, then the new average is ((5 * 4) + 22)/6 which equals seven. Note that you need to know both the current average and the current number of answers to calculate the new average. However, suppose you want one of your pieces of summary data to be a list of the ten most popular words used in answering a free-text question. In this case, you would always have to process all of the survey answers, unless you also maintained a separate list of all the words used and a count of how often they appeared. This adds to the complexity of the second approach.

The key difference between the two approaches is in the number of storage transactions required to perform the summary calculations: this directly affects both the cost of each approach and time it takes to perform the calculations. The graph in Figure 7 shows the result of an analysis that compares the number of transactions per month of the two approaches for three different daily volumes of survey answers. The graph shows the first approach on the upper line with the Recalculate label, and the second approach on the lower line with the Merge label.

Figure 7 - Comparison of transaction numbers for alternative approaches to calculating summary statistics

Figure 7

Comparison of transaction numbers for alternative approaches to calculating summary statistics

The graph clearly shows that fewer storage transactions are required if Tailspin adopts the merge approach. Tailspin decided to implement the merge approach in the Surveys application.

Note

The vertical cost scale on the chart is logarithmic. The analysis behind this chart makes a number of “worst case” assumptions about the way the application processes the survey results. The chart is intended to illustrate the relative difference in transaction numbers between the two approaches; it is not intended to show absolute numbers.

It is possible to optimize the recalculate approach if you decide to sample the survey answers instead of processing every single one when you calculate the summary data. You would need to perform some detailed statistical analysis to determine what proportion of results you need to select to calculate the summary statistics within an acceptable margin of error.

Scaling out the Generate Summary Statistics Task

The Tailspin Surveys application must be able to scale out to handle an increase in the number of survey respondents. This should include enabling multiple instances of the worker role that performs the summary statistics calculation and builds the ordered list of survey responses. For each survey there is just a single set of summary statistics, so the application must be able to handle concurrent access to a single blob from multiple worker roles without corrupting the data. Tailspin considered four options for handling concurrency:

  • Use a single instance of the worker role. While it is possible to scale up by using a larger instance, there is a limit to the scalability of this approach. Furthermore, this option does not include any redundancy if that instance fails.
  • Use the MapReduce programming model. This approach would enable Tailspin to use multiple task instances, but would add to the complexity of the solution.
  • Use pessimistic concurrency. In this approach, the statistics associated with several specific surveys are locked while a worker role processes a batch of new responses. The worker role reads a batch of messages from the queue, identifies the surveys they are associated with, locks those specific sets of summary statistics, calculates and saves the new summary statistics, and then releases the locks. This would mean that other worker instances trying to update any of the same sets of summary statistics are blocked until the first instance releases the locks.
  • Use optimistic concurrency. In this approach, when the worker role instance processes a batch of messages, it checks for each message whether or not another task is updating that specific survey’s summary statistics. If another task is already updating the statistics, the current task puts the message back on the queue to be reprocessed later; otherwise it goes ahead with the update.

Tailspin performed stress testing to determine the optimum solution and chose the fourth option—using optimistic concurrency. It allows Tailspin to scale out the worker role instance, allows for a higher throughput of messages than the pessimistic concurrency approach, and offers better performance because it does not require any locking mechanism. Although MapReduce would also work, it adds more complexity to the system than using the optimistic concurrency approach.

Note

For more information about the stress tests Tailspin carried out, see Chapter 7, “Managing and Monitoring Multi-tenant Applications.” For a description of the MapReduce programming model see the section “The MapReduce Algorithm” earlier in this chapter.

Using Windows Azure Caching

Chapter 4, “Partitioning Multi-Tenant Applications,” describes how the Tailspin uses Windows Azure Caching to support the Windows Azure Caching session state provider, and how Tailspin ensures tenant data is isolated within the cache. Tailspin uses Windows Azure Caching to cache survey definitions and tenant data in order to reduce latency in the public Surveys website.

Tailspin chose to use Windows Azure Caching, configured a cache that is co-located in the Tailspin.Web worker role, and uses 30% of the available memory. Tailspin will monitor cache utilization levels and the performance of the Tailspin.Web role in order to review whether these settings provide enough cache space without affecting the usability of the private tenant web site.

The section “Caching Frequently Used Data” in Chapter 4 shows how caching is implemented in the data access layer. Tailspin Surveys implements caching behavior in the SurveyStore and TenantStore classes.

Using the Content Delivery Network

This section looks at how the Windows Azure Content Delivery Network (CDN) can improve the user experience. The CDN allows you to cache blob content at strategic locations around the world to make that content available with the maximum possible bandwidth to users, and minimize network latency. The CDN is designed to be used with blob content that is relatively static.

Hh534484.note(en-us,PandP.10).gifBharath Says:
Bharath
                The CDN enables you to have data that is stored in blobs cached at strategic locations around the world. You can also use the CDN as an endpoint for delivering streaming content from Windows Azure Media Services.</td>

For the Surveys application, the developers at Tailspin have identified two scenarios where they could use the CDN:

  • Tailspin is planning to commission a set of training videos with titles such as “Getting Started with the Surveys Application,” “Designing Great Surveys,” and “Analyzing your Survey Results.”
  • Hosting the custom images and style sheets that subscribers upload.

In both of these scenarios, users will access the content many times before it’s updated. The training videos are likely to be refreshed only when the application undergoes a major upgrade, and Tailspin expects subscribers to upload corporate logos and style sheets that reflect corporate branding. Both of these scenarios will also account for a significant amount of bandwidth used by the application. Online videos will require sufficient bandwidth to ensure good playback quality, and every request to fill out a survey will result in a request for a custom image and style sheet.

One of the requirements for using the CDN is that the content must be in a blob container that you configure for public, anonymous access. Again, in both of the scenarios, the content is suitable for unrestricted access.

For information about the current pricing for the CDN, see the “Caching” section of the page “Pricing Details” on the Windows Azure website.

For data cached on the CDN, you are charged for outbound transfers based on the amount of bandwidth you use and the number of transactions. You are also charged at the standard Windows Azure rates for the transfers that move data from blob storage to the CDN. Therefore, it makes sense to use the CDN for relatively static content. With highly dynamic content you could, in effect, pay double because each request for data from the CDN triggers a request for the latest data from blob storage.

To use the CDN with the Surveys application, Tailspin will have to make a number of changes to the application. The following sections describe the solution that Tailspin plans to implement in the future; the current version of the Surveys application does not use the CDN.

Setting the Access Control for the BLOB Containers

Any blob data that you want to host on the CDN must be in a blob container with permissions set to allow full public read access. You can set this option when you create the container by calling the BeginCreate method of the CloudBlobContainer class or by calling the SetPermissions method on an existing container. The following code shows an example of how to set the permissions for a container.

protected void SetContainerPermissions(String containerName)
{
  CloudStorageAccount cloudStorageAccount =
    CloudStorageAccount.Parse(
      RoleEnvironment.GetConfigurationSettingValue(
      "DataConnectionString "));

  CloudBlobClient cloudBlobClient = 
    cloudStorageAccount.CreateCloudBlobClient();

  CloudBlobContainer cloudBlobContainer =
    new CloudBlobContainer(containerName, cloudBlobClient);

  BlobContainerPermissions blobContainerPermissions =
    new BlobContainerPermissions();

  blobContainerPermissions.PublicAccess = 
    BlobContainerPublicAccessType.Container;

  cloudBlobContainer.SetPermissions(
                     blobContainerPermissions);
}

Notice that the permission type used to set full public access is BlobContainerPublicAccessType.Container.

Configuring the CDN and Storing the Content

You configure the CDN at the level of a Windows Azure storage account through the Windows Azure Management Portal. After you enable CDN delivery for a storage account, any data in public blob containers is available for delivery by the CDN.

The application must place all the content to be hosted on the CDN into blobs in the appropriate containers. In the Surveys application, media files, custom images, and style sheets can all be stored in these blobs. For example, if a training video is packaged with a player application in the form of some HTML files and scripts, all of these related files can be stored as blobs in the same container.

Note

You must be careful if scripts or HTML files contain relative paths to other files in the same blob container because the path names will be case sensitive. This is because there is no real folder structure within a blob container, and any “folder names” are just a part of the file name in a single, flat namespace.

Configuring URLs to Access the Content

Windows Azure allocates URLs to access blob data based on the account name and the container name. For example, if Tailspin created a public container named “video” for hosting their training videos, you could access the “Getting Started with the Surveys Application” video directly in Windows Azure blob storage at http://tailspin.blob.core.windows.net/video/gettingstarted.html. This assumes that the gettingstarted.html page is a player for the media content.

The CDN provides access to hosted content using a URL in the form http://<uid>.vo.msecnd.net/, so the Surveys training video would be available on the CDN at http://<uid>.vo.msecnd.net/video/gettingstarted.html. Figure 8 illustrates this relationship between the CDN and blob storage.

Figure 8 - The Content Delivery Network

Figure 8

The Content Delivery Network

You can configure a CNAME entry in DNS to map a custom URL to the CDN URL. For example, Tailspin might create a CNAME entry to make http://files.tailspin.com/video/gettingstarted.html point to the video hosted on the CDN. You should verify that your DNS provider configures the DNS resolution to behave efficiently; the performance benefits of using the CDN could be offset if the name resolution of your custom URL involves multiple hops to a DNS authority in a different geographic region.

Note

For information about how to use a custom DNS name with your CDN content, see “How to Map CDN Content to a Custom Domain.”

When a user requests content from the CDN, Windows Azure automatically routes their request to the closest available CDN endpoint. If the blob data is found at that endpoint it’s returned to the user. If the blob data is not found at the endpoint it’s automatically retrieved from blob storage before being returned to the user and cached at the endpoint for future requests.

Hh534484.note(en-us,PandP.10).gifJana Says:
Jana If the blob data is not found at the endpoint, you will incur Windows Azure storage transaction charges when the CDN retrieves the data from blob storage.

Setting the Caching Policy

All blobs cached by the CDN have a time-to-live (TTL) period that determines how long they will remain in the cache before the CDN goes back to blob storage to check for updated data. The default CDN caching policy uses an algorithm to calculate the TTL for cached content based on when the content was last modified in blob storage. The longer the content has remained unchanged in blob storage, the greater the TTL, up to a maximum of 72 hours.

Note

The CDN retrieves content from blob storage only if it is not in the endpoint’s cache, or if it has changed in blob storage.

You can also explicitly set the TTL by using the CacheControl property of the BlobProperties class. The following code example shows how to set the TTL to two hours.

blob.Properties.CacheControl = "max-age=7200";

For more information about how to manage expiration policies with CDN, see “How to Manage Expiration of Blob Content.”

Hosting Tailspin Surveys in Multiple Locations

Hosting a survey in a web role in a different geographic location doesn’t, by itself, mean that people filling out the survey will see the best response times when they use the site. To render the survey, the application must retrieve the survey definition from storage, and the application must save the completed survey results to storage. If the application storage is in the U.S. datacenter, there is little benefit to European users accessing a website hosted in the European datacenter.

Figure 9 shows how Tailspin designed the application to handle this scenario and resolve the issue just described.

Figure 9 - Hosting a survey in a different geographic location

Figure 9

Hosting a survey in a different geographic location

The following describes the steps illustrated in Figure 9:

  1. The subscriber designs the survey, and the application saves the definition in storage hosted in the U.S. datacenter.
  2. The Surveys application pushes the survey definition to another application instance in a European datacenter. This needs to happen only once.
  3. Survey respondents in Europe fill out the survey, and the application saves the data to storage hosted in the European datacenter.
  4. The application transfers the survey results data back to storage in the U.S. datacenter, where it is available to the subscriber for analysis.
Hh534484.note(en-us,PandP.10).gifBharath Says:
Bharath Tailspin must create separate cloud services and storage accounts for each region where it plans to allow subscribers to host surveys.

Tailspin could use caching to avoid the requirement to transfer the survey definitions between data centers in step 2. It could cache the survey definition in Europe and load the survey definition into the cache from the U.S. storage account. This approach means that the Surveys application hosted in Europe must be able to reach the storage account in the U.S. datacenter to be able to load survey definitions into the cache. Geo-replication of data in the U.S. datacenter provides resilience in the case that a major disaster affects the U.S. datacenter, but does not provide resilience in the case of connectivity problems between Europe and the U.S. For more information, see “Introducing Geo-replication for Windows Azure Storage.”

Synchronizing Survey Statistics

While the application data (the survey definitions and the answers submitted by users) is initially stored in the datacenter where subscriber chose to host the survey, the application copies the data to the data center where the subscriber’s account is hosted. Figure 10 illustrates a roles and storage elements in a scenario where the subscriber is based in Europe and has chosen to host a survey in a U.S. datacenter.

Figure 10 - Saving the responses from a survey hosted in a different datacenter

Figure 10

Saving the responses from a survey hosted in a different datacenter

This scenario is similar to the scenario described earlier in the section “Writing Directly to Storage” earlier in this chapter, but there is now an additional worker role. This worker role is responsible for moving the survey response data from the datacenter where the subscriber chose to host the survey to the datacenter hosting the subscriber’s account. This way, the application transfers the survey data between datacenters only once instead of every time the application needs to read it; this minimizes the costs associated with this scenario.

Hh534484.note(en-us,PandP.10).gifJana Says:
Jana The Surveys application reads survey response data when it calculates the statistics, when a subscriber browses through the responses, and when it exports the data to Windows Azure SQL Database.

In some scenarios, it may make sense to pre-process or summarize the data in the datacenter where it’s collected and transfer back only the summarized data to reduce bandwidth costs. For the Surveys application, Tailspin decided to move all the data back to the subscriber’s datacenter. This simplifies the implementation, helps to optimize the paging feature, ensures that each response is moved between datacenters only once, and ensures that the subscriber has access to all the survey data in the local data center.

The sample application does not currently implement this scenario.

When you deploy a Windows Azure application, you select the sub-region where you want to host the application. This sub-region effectively identifies the datacenter hosting your application. You can also define affinity groups to ensure that interdependent Windows Azure applications and storage accounts are grouped together. This improves performance because Windows Azure co-locates members of the affinity group in the same datacenter cluster, and reduces costs because data transfers within the same datacenter do not incur bandwidth charges. Affinity groups offer a small advantage over simply selecting the same sub-region for your hosted services because Windows Azure makes a “best effort” to optimize the location of those services.

Autoscaling and Tailspin Surveys

Tailspin plans to use the Autoscaling Application Block to make the Surveys application elastic. It will configure rules that set the minimum and maximum number of instances for each role type within Tailspin Surveys. For each role type, the minimum number of instances will be five, and Tailspin will adjust the maximum as more subscribers sign up for the service.

Tailspin will also configure dynamic scaling that is based on monitoring key metrics in the Tailspin Surveys application. Initially, it will create rules based on the CPU usage of the different roles and the length of the Windows Azure queues that pass messages to the worker role instances.

Tailspin does not plan to use scheduled rules that adjust the number of role instances based on time and date. However, it will analyze usage of the application to determine whether there are any usage patterns that it can identify in order to preemptively scale the application at certain times.

Note

For more information about how to add the Autoscaling Application Block to a Windows Azure application and how to configure your autoscaling rules, see the “Enterprise Library 5.0 Integration Pack for Windows Azure.”

Inside the Implementation

Now is a good time to walk through some of the code in the Tailspin Surveys application in more detail. As you go through this section you may want to download the Visual Studio solution for the Tailspin Surveys application from https://wag.codeplex.com/.

The Tailspin Surveys application uses a single worker role type to host two different asynchronous background tasks:

  • Calculating summary statistics.
  • Exporting survey response data to SQL Database.

The task that calculates the summary statistics also maintains the list of survey responses that enables subscribers to page through responses. Chapter 3, “Choosing a Multi-Tenant Data Architecture,” describes this part of the task. Chapter 3 also describes how the export to Windows Azure SQL Database works.

Saving the Survey Response Data Asynchronously

Before the task in the worker role can calculate the summary statistics, the application must save the survey response data to blob storage. The following code from the SurveysController class in the TailSpin.Web.Survey.Public project shows how the application saves the survey responses.

[HttpPost]
public ActionResult Display(string tenant, 
                            string surveySlug,
                            SurveyAnswer contentModel)
{
  var surveyAnswer = CallGetSurveyAndCreateSurveyAnswer(
    this.surveyStore, tenant, surveySlug);

  ...

  for (int i = 0; 
       i < surveyAnswer.QuestionAnswers.Count; i++)
  {
    surveyAnswer.QuestionAnswers[i].Answer = 
      contentModel.QuestionAnswers[i].Answer;
  }

  if (!this.ModelState.IsValid)
  {
    var model = 
      new TenantPageViewData<SurveyAnswer>(surveyAnswer);
    model.Title = surveyAnswer.Title;
    return this.View(model);
  }

  this.surveyAnswerStore.SaveSurveyAnswer(surveyAnswer);

  return this.RedirectToAction("ThankYou");
}

The surveyAnswerStore variable holds a reference to an instance of the SurveyAnswerStore type. The application uses the Unity Application Block (Unity) to initialize this instance with the correct IAzureBlob and IAzureQueue instances.

Note

Unity is a lightweight, extensible dependency injection container that supports interception, constructor injection, property injection, and method call injection. You can use Unity in a variety of ways to help decouple the components of your applications, to maximize coherence in components, and to simplify design, implementation, testing, and administration of these applications. For more information, and to download the application block, see “Unity Application Block.”

The blob container stores the answers to the survey questions, and the queue maintains a list of new survey answers that haven’t yet been included in the summary statistics or the list of survey answers.

The SaveSurveyAnswer method writes the survey response data to the blob storage and puts a message onto a queue. The action method then immediately returns a “Thank you” message. The following code example shows the SaveSurveyAnswer method in the SurveyAnswerStore class.

public void SaveSurveyAnswer(SurveyAnswer surveyAnswer)
{
  var tenant = this.tenantStore
    .GetTenant(surveyAnswer.Tenant);
  if (tenant != null)
  {
    var surveyAnswerBlobContainer = this
      .surveyAnswerContainerFactory
      .Create(surveyAnswer.Tenant, surveyAnswer.SlugName);

    surveyAnswer.CreatedOn = DateTime.UtcNow;
    var blobId = Guid.NewGuid().ToString();
    surveyAnswerBlobContainer.Save(blobId, surveyAnswer);

    (SubscriptionKind.Premium.Equals(
       tenant.SubscriptionKind)
       ? this.premiumSurveyAnswerStoredQueue
       : this.standardSurveyAnswerStoredQueue)
       .AddMessage(new SurveyAnswerStoredMessage
       {
         SurveyAnswerBlobId = blobId,
         Tenant = surveyAnswer.Tenant,
         SurveySlugName = surveyAnswer.SlugName
       });
  }
}
Hh534484.note(en-us,PandP.10).gifPoe Says:
Poe Make sure that the storage connection strings in your deployment point to storage in the deployment’s datacenter. The application should use local queues and blob storage to minimize latency. Also ensure that you call the CreateIfNotExist method of a queue or blob only once in your storage class constructor, and not in every call to store data. Repeated calls to the CreateIfNotExist method will hurt performance.

This method first locates the blob container for the survey responses. It then creates a unique blob ID by using a GUID, and saves the blob to the survey container. Finally, it adds a message to a queue. The application uses two queues, one for premium subscribers and one for standard subscribers, to track new survey responses that must be included in the summary statistics and the list of responses for paging through answers.

Hh534484.note(en-us,PandP.10).gifMarkus Says:
Markus It’s possible that the role could fail after it adds the survey data to blob storage but before it adds the message to the queue. In this case, the response data would not be included in the summary statistics or the list of responses used for paging. However, the response would be included if the user exported the survey to Windows Azure SQL Database. Tailspin has decided that this is an acceptable risk in the Surveys application.

Calculating the Summary Statistics

The team at Tailspin decided to implement the asynchronous background task that calculates the summary statistics from the survey results by using a merge approach. Each time the task runs it processes the survey responses that the application has received since the last time the task ran. It calculates the new summary statistics by merging the new results with the old statistics.

Worker role instances, defined in the TailSpin.Workers.Surveys project, periodically scan two queues for pending survey answers to process. One queue contains a list of unprocessed responses to premium subscribers’ surveys; the other queue contains the list of unprocessed responses to standard subscribers’ surveys.

The worker role instances executing this task use an optimistic concurrency approach when they try to save the new summary statistics. If one instance detects that another instance updated the statistics for a particular survey while it was processing a batch of messages, it does not perform the update for this survey and puts the messages associated with it back onto the queue for processing again.

The following code example from the UpdatingSurveyResultsSummaryCommand class shows how the worker role processes each temporary survey answer and then uses them to recalculate the summary statistics.

public class UpdatingSurveyResultsSummaryCommand :
                IBatchCommand<SurveyAnswerStoredMessage>
{
  private readonly
    IDictionary<string, TenantSurveyProcessingInfo>
    tenantSurveyProcessingInfoCache;
  private readonly ISurveyAnswerStore surveyAnswerStore;
  private readonly
    ISurveyAnswersSummaryStore surveyAnswersSummaryStore;

  public UpdatingSurveyResultsSummaryCommand(
    IDictionary<string, TenantSurveyProcessingInfo>
      processingInfoCache,
    ISurveyAnswerStore surveyAnswerStore,
    ISurveyAnswersSummaryStore surveyAnswersSummaryStore)
  {
    this.tenantSurveyProcessingInfoCache =
                  processingInfoCache;
    this.surveyAnswerStore = surveyAnswerStore;
    this.surveyAnswersSummaryStore =
                  surveyAnswersSummaryStore;
  }

  public void PreRun()
  {
    this.tenantSurveyProcessingInfoCache.Clear();
  }

  public bool Run(SurveyAnswerStoredMessage message)
  {
    if (!message.AppendedToAnswers)
    {
      this.surveyAnswerStore
        .AppendSurveyAnswerIdToAnswersList(
          message.Tenant,
          message.SurveySlugName,
          message.SurveyAnswerBlobId);
      message.AppendedToAnswers = true;
      message.UpdateQueueMessage();
    }

    var surveyAnswer = this.surveyAnswerStore
      .GetSurveyAnswer(
        message.Tenant,
        message.SurveySlugName,
        message.SurveyAnswerBlobId);

    var keyInCache = string.Format(
      CultureInfo.InvariantCulture, "{0}-{1}",
      message.Tenant, message.SurveySlugName);
    TenantSurveyProcessingInfo surveyInfo;

    if (!this.tenantSurveyProcessingInfoCache
        .ContainsKey(keyInCache))
    {
      surveyInfo = new TenantSurveyProcessingInfo(
                message.Tenant, message.SurveySlugName);
      this.tenantSurveyProcessingInfoCache[keyInCache] =
                surveyInfo;
    }
    else
    {
      surveyInfo =
        this.tenantSurveyProcessingInfoCache[keyInCache];
    }

    surveyInfo.AnswersSummary.AddNewAnswer(surveyAnswer);
    surveyInfo.AnswersMessages.Add(message);

    return false; // Don't remove the message from the queue
  }

  public void PostRun()
  {
    foreach (var surveyInfo in
               this.tenantSurveyProcessingInfoCache.Values)
    {
      try
      {
        this.surveyAnswersSummaryStore
          .MergeSurveyAnswersSummary(
            surveyInfo.AnswersSummary);

        foreach (var message in surveyInfo.AnswersMessages)
        {
          try
          {
            message.DeleteQueueMessage();
          }
          catch (Exception e)
          {
            TraceHelper.TraceWarning(
              "Error deleting message for '{0-1}': {2}",
              message.Tenant, message.SurveySlugName,
              e.Message);
          }
        }
      }
      catch (Exception e)
      {
        // Do nothing. This leaves the messages in 
        // the queue ready for processing next time.
        TraceHelper.TraceWarning(e.Message);
      }
    }
  }
}

The Surveys application uses Unity to initialize an instance of the UpdatingSurveyResultsSummaryCommand class, and the surveyAnswerStore and surveyAnswersSummaryStore variables. The surveyAnswerStore variable is an instance of the SurveyAnswerStore type that the Run method uses to read the survey responses from blob storage.

The surveyAnswersSummaryStore variable is an instance of the SurveyAnswersSummary type that the PostRun method uses to write summary data to blob storage. The surveyAnswersSummaryCache dictionary holds a SurveyAnswersSummary object for each survey.

The PreRun method runs before the task reads any messages from the queue and initializes a temporary cache for the new survey response data.

The Run method runs once for each new survey response. It uses the message from the queue to locate the new survey response, and adds the survey response to the SurveyAnswersSummary object for the appropriate survey by calling the AddNewAnswer method. The AddNewAnswer method updates the summary statistics in the surveyAnswersSummaryStore instance. The Run method also calls the AppendSurveyAnswerIdToAnswersList method to update the list of survey responses that the application uses for paging. The Run method leaves all the messages in the queue in case the task encounters an optimistic concurrency when it tries to save the results in the PostRun method.

Hh534484.note(en-us,PandP.10).gifMarkus Says:
Markus The Run method calls the UpdateQueueMessage method on the message after it has updated the list of stored survey responses to prevent a timeout from occurring that could cause the message to be reprocessed. For more information, see “CloudQueue.UpdateMessage Method.”

The PostRun method runs after the task has invoked the Run method on each outstanding survey response message in the current batch. For each survey, it merges the new results with the existing summary statistics and then it saves the new values back to blob storage. The EntitiesBlobContainer detects any optimistic concurrency violations when it tries to save the new summary statistics and raises an exception. The PostRun method catches these exceptions and leaves the messages associated with current survey statistics on the queue so that they will be processed in another batch.

The worker role uses some “plumbing” code developed by Tailspin to invoke the PreRun, Run, and PostRun methods in the UpdatingSurveyResultsSummaryCommand class on a schedule. Chapter 4, “Partitioning Multi-Tenant Applications,” describes this plumbing code in detail as part of the explanation about how Tailspin partitions the work in a worker role by using different tasks. The following code example shows how the Surveys application uses the “plumbing” code in the Run method in the worker role to run the three methods that comprise the job.

var standardQueue = this.container.Resolve
  <IAzureQueue<SurveyAnswerStoredMessage>>
  (SubscriptionKind.Standard.ToString());
var premiumQueue = this.container.Resolve
  <IAzureQueue<SurveyAnswerStoredMessage>>
  (SubscriptionKind.Premium.ToString());

BatchMultipleQueueHandler
  .For(premiumQueue, GetPremiumQueueBatchSize())
  .AndFor(standardQueue, GetStandardQueueBatchSize())
  .Every(TimeSpan.FromSeconds(
    GetSummaryUpdatePollingInterval()))
  .WithLessThanTheseBatchIterationsPerCycle(
    GetMaxBatchIterationsPerCycle())
  .Do(this.container
    .Resolve<UpdatingSurveyResultsSummaryCommand>());

This method first uses Unity to instantiate the UpdatingSurveyResultsSummaryCommand object that defines the job and the AzureQueue object that holds notifications of new survey responses.

The method then passes these objects as parameters to the For, AndFor, and Do plumbing methods of the worker role framework. The Every method specifies how frequently the job should run. These methods cause the plumbing code to invoke the PreRun, Run, and PostRun method in the UpdatingSurveyResultsSummaryCommand class, passing a message from the queue to the Run method.

You should tune the frequency at which these tasks run based on your expected workloads by changing the value passed to the Every method.

Pessimistic and Optimistic Concurrency Handling

Tailspin uses optimistic concurrency when it saves summary statistics and survey answer lists to blob storage. The Surveys application enables developers to choose either optimistic or pessimistic concurrency when saving blobs. The following code sample from the SurveyAnswersSummaryStore class shows how the Surveys application uses optimistic concurrency when it saves a survey’s summary statistics to blob storage.

OptimisticConcurrencyContext context;

var id = string.Format(CultureInfo.InvariantCulture,
  "{0}-{1}", partialSurveyAnswersSummary.Tenant,
  partialSurveyAnswersSummary.SlugName);

var surveyAnswersSummaryInStore = this
  .surveyAnswersSummaryBlobContainer.Get(id, out context);

partialSurveyAnswersSummary
  .MergeWith(surveyAnswersSummaryInStore);

this.surveyAnswersSummaryBlobContainer
  .Save(context, partialSurveyAnswersSummary);

In this example the application uses the Get method to retrieve content to update from a blob. It then makes the change to the content and calls the Save method to try to save the new content. It passes in the OptimisticConcurrencyContext object that it received from the Get method. If the Save method encounters a concurrency violation, it throws an exception and does not save the new blob content.

The following code samples from the EntitiesBlobContainer class show how it creates a new OptimisticConcurrencyContext object in the DoGet method using an ETag object, and then uses the OptimisticConcurrencyContext object in the DoSave method to create a BlobRequestOptions object that contains the ETag and an access condition. The content of the BlobRequestOptions object enables the UploadText method to detect a concurrency violation; the method can then throw an exception to notify the caller of the concurrency violation.

protected override T DoGet(string objId,
          out OptimisticConcurrencyContext context)
{
  CloudBlob blob = this.Container.GetBlobReference(objId);
  blob.FetchAttributes();
  context = new OptimisticConcurrencyContext
               (blob.Properties.ETag) { ObjectId = objId };
  return new JavaScriptSerializer()
    .Deserialize<T>(blob.DownloadText());
}

protected override void DoSave(
                IConcurrencyControlContext context, T obj)
{
  ...

  if (context is OptimisticConcurrencyContext)
  {
    CloudBlob blob =
      this.Container.GetBlobReference(context.ObjectId);
    blob.Properties.ContentType = "application/json";

    var blobRequestOptions = new BlobRequestOptions()
    {
      AccessCondition =
        (context as OptimisticConcurrencyContext)
        .AccessCondition
    };

    blob.UploadText(
      new JavaScriptSerializer().Serialize(obj),
      Encoding.Default, blobRequestOptions);
  }
  else if (context is PessimisticConcurrencyContext)
  {
    ...
  }
}

More Information

For more information about scalability and throttling limits, see the following:

For more information about building large-scale applications for Windows Azure, see “Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services” on MSDN.

For more information about the CDN, see “Content Delivery Network” on MSDN.

For information about an application that uses the CDN, see the post “EmailTheInternet.com: Sending and Receiving Email in Windows Azure” on Steve Marx’s blog.

For an episode of Cloud Cover that covers CDN, see “Cloud Cover Episode 4 - CDN” on Channel 9.

For a discussion of how to make your Windows Azure application scalable, see “Real World: Designing a Scalable Partitioning Strategy for Windows Azure Table Storage.”

For more information about autoscaling in Windows Azure, see “The Autoscaling Application Block.”

For a discussion of approaches to autoscaling in Windows Azure, see “Real World: Dynamically Scaling a Windows Azure Application.”

For a discussion of approaches to simulating load on a Windows Azure application, see “Real World: Simulating Load on a Windows Azure Application.”

For more information about the MapReduce algorithm, see the entry for “MapReduce” on Wikipedia.

Next Topic | Previous Topic | Home | Community