7: Making Tailspin Surveys More Resilient

Article
10/29/2014

Retired Content
This content and the technology described is outdated and is no longer being maintained. For more information, see Transient Fault Handling.

On this page:
The Premise \| Goals and Requirements \| Overview of the Transient Fault Handling Application Block Solution \| Inside the Implementation \| Setup and Physical Deployment \| More Information

This chapter walks you through the changes that Tailspin made when it added the Transient Fault Handling Application Block to the Surveys application in order to improve the resilience of the application to transient fault conditions in the Microsoft Azure™ technology platform environment.

Applications that run on the Azure must be able to handle transient fault conditions gracefully and efficiently in order to reduce the potential impact of transient conditions on the application's stability.

The Premise

In order to meet the requirements of its larger customers, Tailspin agreed to increase the service levels in their service-level agreements (SLAs), especially with regards to the reliability and availability of the Surveys application. These customers also have more stringent performance requirements—for example, the maximum time the data export to SQL Azure™ technology platform should take. To meet these new SLA requirements, Tailspin closely re-examined the Surveys application to see where it could improve the application's resilience.

Beth Says:
	`Improving the reliability and resilience of Surveys is vital if Tailspin is going to succeed in attracting larger customers.</td>`

Tailspin discovered that when the Surveys application makes calls to SQL Azure or Azure Storage, transient conditions sometimes cause errors. The call succeeds if Tailspin retries the operation a short time later, when the transient condition has cleared.

The Tailspin Surveys application uses Azure storage and SQL Azure. Survey definitions are stored in Azure tables, customer configuration data is stored in Azure blob storage, and survey answers are also stored in Azure blob storage. The Surveys application also enables customers to export survey data to SQL Azure where customers can perform their own detailed analysis of the results. For some customers, the SQL Azure instance is located in a different data center from where the customer's surveys are hosted.

Operators have noticed occasional errors in the Surveys application log files that relate to storage errors. These errors are not related to specific areas of functionality, but appear to occur at random. There have been a small number of reports that users creating new surveys have lost the survey definition when they clicked the Save button in the user interface (UI).

There have also been occasions when long-running jobs that export data to SQL Azure have failed. Because there is no resume method for partially completed export tasks, Tailspin must restart the export process from the beginning. Tailspin has rerun the jobs they have not completed successfully, but this has meant that Tailspin has failed to meet its SLA with the customer. Where the export is to a different data center than the one that hosts the survey definitions, Tailspin has incurred additional bandwidth-related costs as a result of having to rerun the export job.

Goals and Requirements

Tailspin wants to implement automatic retry logic for all of its Azure storage operations to improve the overall reliability of the application. It wants to minimize the risk of losing survey data and creating inaccurate statistics. Tailspin wants to ensure that the application is as resilient as possible, so that it can recover from any transient errors without operator intervention. It also wants to minimize the chance of customers experiencing errors when they are creating new survey definitions.

Tailspin also wants to improve the reliability of the export tasks that send data to SQL Azure so that it can meet its SLAs with its customers.

Tailspin wants to be able to tune the retry policies (for example, by adjusting the back-off delay), in different scenarios. Some tasks are more time critical, such as saving a new survey definition where a user is waiting for an acknowledgement that the definition has been saved; other tasks are less time critical, such as the statistics calculation, which is not designed to give real-time results.

Overview of the Transient Fault Handling Application Block Solution

The Transient Fault Handling Application Block enables you to add retry logic to your cloud-based application. You can use the application block to apply a retry policy to any calls that may experience errors as a result of transient conditions.

The Transient Fault Handling Application Block includes detection strategies that can identify exceptions that may be caused by transient faults. Tailspin Surveys uses Azure storage and SQL Azure; the Transient Fault Handling Application Block includes detection strategies for these services.

The Transient Fault Handling Application Block uses retry strategies to define retry patterns: the number of retries and the interval between them. These retry strategies can be defined in code or in configuration. Tailspin plans to use retry strategies defined in configuration so that it is easier to tune the behavior of the retry strategies used by the Surveys application.

Inside the Implementation

This section describes some of the details of how Tailspin uses the Transient Fault Handling Application Block and how it modified the Surveys application to use the application block. If you are not interested in the details, you can skip to the next section.

You may find it useful to have the Tailspin solution open in Visual Studio while you read this section so that you can refer to the code directly.

For instructions on installing the Tailspin Surveys application, see Appendix B, "Tailspin Surveys Installation Guide."

Tailspin uses the Transient Fault Handling Application Block in the Surveys application wherever it is using the Azure storage API or invoking an operation on a SQL Azure database. For example, it uses the application block in the code that accesses the rule and service information stores, in the wrapper classes for the Azure storage types, and in the SurveySqlStore class. All of these classes are located in the Tailspin.Shared project.

The configuration file for each worker and web role in the Surveys application includes the retry strategies shown in the following code snippet.

<RetryPolicyConfiguration 
    defaultRetryStrategy="Fixed Interval Retry Strategy" 
    defaultAzureStorageRetryStrategy="Fixed Interval Retry Strategy" 
    defaultSqlCommandRetryStrategy="Backoff Retry Strategy">
  <incremental name="Incremental Retry Strategy" retryIncrement="00:00:01" 
    initialInterval="00:00:01" maxRetryCount="10" />
  <fixedInterval name="Fixed Interval Retry Strategy" retryInterval="00:00:05" 
    maxRetryCount="6" firstFastRetry="true" />
  <exponentialBackoff name="Backoff Retry Strategy" minBackoff="00:00:05" 
    maxBackoff="00:00:45" deltaBackoff="00:00:04" maxRetryCount="10" />
</RetryPolicyConfiguration>

Tailspin uses the Enterprise Library configuration tool to edit these settings.

Markus Says:
	Tailspin Surveys uses a limited number of retry strategies from a limited number of locations in code. This example shows a number of default retry strategies to make it easier to maintain the code.

Tailspin uses the RetryManager class to load the retry strategies from the configuration file and instantiate a retry policy. The following code snippet from the RuleSetModelStore class shows an example of how Tailspin creates a new retry policy that uses the Azure storage detection strategy and the "Incremental Retry Strategy" from the configuration.

public RuleSetModelStore(
    RuleSetModelToXmlElementConverter ruleSetModelToXmlElementConverter,
    [Dependency("RuleSetModel")] IConfigurationFileAccess fileAccess,
    RetryManager retryManager)
{
    this.retryPolicy = retryManager.GetRetryPolicy
      <StorageTransientErrorDetectionStrategy>
      (AzureConstants.FaultHandlingPolicies.Incremental);

    ...
}

Note

You should be careful of trying to load retry strategies from the web.config file by using the RetryPolicyFactory or RetryManager classes in the web role OnStart event. See the topic "Specifying Retry Strategies in the Configuration" on MSDN for more details.

If you are using the Transient Fault Handling Application Block with Azure storage, you should be careful not to use the built-in retry policies in the Azure storage APIs. The following code snippet from the AzureQueue class in the Tailspin.Shared project shows how Tailspin disables the built-in retry policies.

var client = this.account.CreateCloudQueueClient();
client.RetryPolicy = RetryPolicies.NoRetry();

The following code snippet from the GetFileContent method in the RuleModelStore class shows how Tailspin wraps a call that accesses Azure storage that may be affected by transient fault conditions with the retry policy.

try
{
    return this.retryPolicy.ExecuteAction(
        () => this.fileAccess.GetFileContent());
}
catch (ConfigurationFileAccessException)
{
    return null;
}

Jana Says:
	If Tailspin wanted to collect information about the retries in the application, it could use the Retrying event in the retry policy to capture the details and log them for analysis.

Tailspin uses the same approach when the Surveys application saves data to SQL Azure, as shown in the following code sample from the SurveySqlStore class. This example also shows how to load a default policy from configuration.

public SurveySqlStore()
{
    this.retryPolicy = RetryPolicyFactory.GetDefaultSqlCommandRetryPolicy();
}

public void SaveSurvey(string connectionString, SurveyData surveyData)
{
    using (var dataContext = new SurveySqlDataContext(connectionString))
    {
        dataContext.SurveyDatas.InsertOnSubmit(surveyData);
        try
        {
            this.retryPolicy.ExecuteAction(() => dataContext.SubmitChanges());
        }
        catch (SqlException ex)
        {
            Trace.TraceError(ex.TraceInformation());
            throw;
        }
    }
}

Markus Says:
	Tailspin Surveys uses LINQ to SQL as an object relational mapper. All database interactions are abstracted by the data model; therefore, Tailspin does not have to use the ReliableSqlConnection class or the SQL Azure extension classes provided by the Transient Fault Handling Application Block.

Tailspin's data access requirements are relatively simple, so it only needs to use the simplest version of the ExecuteAction method. It does not need to wrap any calls that return values or make any asynchronous calls.

Setup and Physical Deployment

The Tailspin Surveys application uses retry strategies defined in the configuration files for the roles that use Azure storage and SQL Azure. In the sample, all of these roles use the same retry strategies. In a real-world deployment you should adjust the retry strategies to meet the specific requirements of your application.

More Information

For instructions on installing the Tailspin Surveys application, see Appendix B, "Tailspin Surveys Installation Guide" on MSDN:
https://msdn.microsoft.com/en-us/library/hh680894(v=PandP.50).aspx

For more information about retry strategies, see "Specifying Retry Strategies in the Configuration" on MSDN:
https://msdn.microsoft.com/en-us/library/hh680900(v=PandP.50).aspx

Next Topic | Previous Topic | Home

Last built: June 7, 2012