January 2011

Volume 26 Number 01

Forecast: Cloudy - Branch-Node Synchronization with SQL Azure

By Joseph Fultz | January 2011

In my years prior to joining Microsoft and for the first few years thereafter, I was heavily involved in the retail industry. During this time, I found it almost humorous to see how many times the branch-node synchronization problem gets resolved as technologies advance.

In my current role, I’ve had a fair amount of exposure to the oil and gas (O&G) industry, and I’ve found that it has a similar problem of synchronizing data between nodes. Much like retail chains, O&G companies have a wide range of devices and connectivity challenges. From a latent satellite connection on an offshore oil platform to an engineer in an oil field, the requirement for timely, accurate data doesn’t change.

So, with both retail and O&G in mind, I’m taking a look at the challenge again, but this time with a little help from SQL Azure and the Sync Framework. I’ll discuss how the cloud will help solve the problem of moving data between the datacenter (corporate), the branches (for example, store, rig, hub and so on) and to individual devices (handheld, shared terminal, specific equipment and more).

This month, I’ll focus a little more on the general architecture and a little less on the implementation. I’ll still give a few code examples for setting up synchronization between nodes and SQL Azure and filtering content as a way to reduce traffic and time required for synchronization. In next month’s column, I’ll examine using a service-based synchronization approach to provide a scalable synchronization solution beyond splitting data across SQL Azure databases based on content or geographic distribution.

While the core problem hasn’t changed, what have changed are the additional requirements that get added to the mix as technology becomes more advanced. Instead of solving the simple problem of moving data between nodes, we start adding things we’d like to have, such as increasing the data volume to get more detail, inclusion of various devices for collecting and displaying data, and closer-to-real-time feeds.

Let’s face it, the more we have, the more we want. In most cases, solving the data flow problems from a decade ago would be simple, but in today’s world that solution would only represent the substrate of a more robust solution. For retail, the data flow can be pretty easy—taking the form of pushing catalog-type data (menu, warehouse and so on) down and t-logs back up—to quite complex, by frequently updating inventory levels, real-time loss-prevention analysis, and manual product entries from the branch to corporate and between branches. For the most part, O&G companies have the same patterns, but they have some added complexities related to the operation, evaluation and adjustment of equipment in use. I think about synchronization in the following ways to get a rough idea of the level of complexity (each one is subsequently more complex to implement and support):

  1. Push read-only data from corporate to branch and onward.
  2. Two one-way pushes on different data; one from corporate to branch (for example, catalog) and one from branch to corporate (for example, t-logs and inventory); this includes branch to branch—the focus is on the fact that it’s basically two or more one-way syncs.
  3. Bidirectional data synchronization between corporate and nodes (for example, manual product entry or employee information).
  4. Peer synchronization between branches and between branches and corporate.

Type 4 is by far the most complex problem and typically leads to many conflicts. Therefore, I try to avoid this pattern, and the only two criteria that would force it are the need for real-time updates between nodes or the ability to sync branches if the corporate data store isn’t accessible. Because near-real-time or real-time updates among too many nodes would generally create too much traffic and isn’t typically a reasonable solution, the only criterion to which I really pay attention is the ability to sync without the master. In some cases, real-time information is needed between nodes, but this isn’t generally the case for data synchronization. Rather, it’s an event notification scenario, and a different tack is taken to address the need.

Defining the Solution Architecture

Generally, the most prevalent pattern that I see is to push data directly from the corporate master database (via a distributor of some type) down to servers at the branches and to mobile users. The distribution to workstations, point-of-sale (POS) terminals and other such devices is typically done from the server at the branch location (commonly called “back-of-house servers”), whereas the synchronization to mobile users (for example, laptops) is done directly from corporate to the machine via a client-initiated synchronization process (see Figure 1).

image: Typical Architecture for Data Distribution

Figure 1 Typical Architecture for Data Distribution

Some organizations do this via the built-in replication features of the relational database management system (RDBMS), while others build processes to handle the distribution and collection of data. I’ll maintain the pattern, but use an instance of SQL Azure in place of the distributor; in place of replication, I’ll use the Sync Framework, which supports SQL Azure. Thus, I simply add a layer between the distributor and the branches (see Figure 2).

image: Base Architecture Using SQL Azure and the Sync Framework

Figure 2 Base Architecture Using SQL Azure and the Sync Framework

What do I get by inserting SQL Azure? Some of the benefits in a branch-node scenario are:

  1. Scaling the data service without having to grow the datacenter footprint.
  2. High availability of the data without additional cost and effort.
  3. Potentially less security sensitivity, because it isn’t the master data store.

Consider that in the first scenario, if the corporate connection or data store is down, all of the clients will have to hold on to their transactions. This could easily lead to lost data due to losing the device while waiting for the connection or due to simply running out of space on the device to store the transactions, as mentioned earlier. Additionally, if the branches have common data (for example, warehouse inventory data), they’ll be working off of old data until corporate is back up. While there’s no perfect solution, SQL Azure addresses this scenario by automatically making copies of the data and providing automatic failover. Also, by segmenting the flow of data through multiple SQL Azure databases, I can reduce the risk of being down and reduce load exposure further by using not only a separate instance but also different datacenters.

As a point of design, I must consider the impact of initiating synchronization from the branches or from the server. If the application is designed to sync from the master or distributor to the nodes, I enjoy the advantage of fewer points of management and support; on the downside, it puts some technical strains on the implementation, requiring:

  1. Knowing the endpoints.
  2. Knowing the appropriate scopes for each target.
  3. Complexity in the synchronization process in order to make the synchronization of multiple nodes happen in parallel; the API semantics are really one pair of endpoints and one scope at a time.

By initiating synchronization from the target (node or branch, for example) the complexity is diminished for the synchronization code, as it:

  • Can focus on the scope(s) for the application/device.
  • More easily handles being occasionally connected.
  • Only has to know and manage a few endpoints by which the distribution data is synchronized.

However, it will put a bit more complexity into the applications on the target device and could complicate support and maintenance by potentially having to debug the sync process or agent at each device. Ideally, if data must be synchronized for different applications, a separate process should be created that manages the synchronization based on a configuration file where scope, frequency and connection strings are defined for a sync agent to run. This sync agent would exist externally to the applications that are the data consumers on the devices, although the process would provide a means for a given application to initiate synchronization of its data. This gives the benefit of initiating synchronization from the node, but it also reduces the support and maintenance aspect, because it’s rolled up into a single process.

Using the Sync Framework, I tend to start with a mixed model of synchronization by initiating from the master data store to SQL Azure and subsequently initiating from the nodes to sync between the node and SQL Azure. Restated, one might say that data is pushed from master and pulled from the branches, with SQL Azure becoming the highly available central hub between the master and branches. Based on the needs and constraints of the solution being considered, I think about the costs and benefits of moving the synchronization process control from one point in the chain to another (for example, device to cloud or corporate to cloud). Just a few of these considerations are questions such as:

  • Is there a place to host the process for master? 
  • Are there security policies that conflict with hosting the sync process in SQL Azure?
  • How many nodes at each level are synchronizing?
  • Can the target device realistically support a sync process?
  • What’s the requirement with regard to timeliness of data sync?

What’s more, each of these questions has multiple layers that need to be considered and against which possible solution designs must be vetted. While there isn’t a one-for-all design, I like to start with the model described earlier and either synchronize multiple one-way syncs to accomplish something similar to bidirectional data sync, or use bidirectional synchronization between the device/corporate database and SQL Azure. After which, I look for scenarios that invalidate and force a modification to the design. Generally, the only synchronization style I attempt to avoid is peer-to-peer.

Setting up Synchronization

There are two methods to set up synchronization using Sync Framework 2.1: sync client in the cloud and sync client on the local machine. I’ll focus on the latter for the moment. At its simplest, here are the steps to set up a synchronization relationship:

  1. Identify the data to be synchronized and the direction of data flow. This will be used to define the scopes (SqlSyncScopeProvisioning) used to synchronize the data.
  2. Download and install the Sync Framework 2.1 (www.microsoft.com/download/details.aspx?id=23217). Note: If x64 is the target platform, a build target for x64 will need to be added or the SyncOrchestrator won’t be able to resolve its dependencies.
  3. Provision the databases and tables for synchronization; the entire database could be provisioned, or only specific tables, or it can be limited to given columns.
  4. Add necessary filters. In the case that it’s desirable to horizontally partition or otherwise filter the data, filters may be used.
  5. Create and run the process to synchronize.

I’ll be rather specific in this example, as it helps convey the message, and I’ll start with databases in place at both ends. I create a connection to the local database and retrieve a definition (DbSyncTableDescription) for the table to be synchronized and add that table to the scope (DbSyncScopeDescription). In addition, I’ll specify the particular columns, but this isn’t necessary if the desire is to simply synchronize the entire table. Limiting the sync relationship to specific columns is a good way to optimize bandwidth usage and speed up processes (see Figure 3).

Figure 3 Creating a Synchronization Scope

SqlConnection azureConn = new SqlConnection(AzureConnectionString);
SqlConnection onPremiseConn = new SqlConnection(LocalConnectionString);
// List of columns to include
Collection<string> columnsToInclude = new Collection<string>();
// Definition for authors from local DB
DbSyncTableDescription authorsDescription =
  columnsToInclude, onPremiseConn);
// Create a scope and add tables to it
DbSyncScopeDescription authorScopeDesc = new DbSyncScopeDescription(ScopeName);
// Add the authors table to the sync scope

For each structure that needs to be synchronized, a bit of code will need to be added to get the description; you must subsequently add it to the scope. The next step is to grab a scope-provisioning object and use it to provision each database if the scope doesn’t already exist in that database, as shown in Figure 4.

Figure 4 Provisioning Scope

// Create a provisioning object for "customers" and 
// apply it to the on-premises database
SqlSyncScopeProvisioning onPremScopeConfig = 
  new SqlSyncScopeProvisioning(onPremiseConn, authorsScopeDesc);
if (!(onPremScopeConfig.ScopeExists(authorsScopeDesc.ScopeName)))
// Provision the SQL Azure database from the on-premises SQL Server database
SqlSyncScopeProvisioning azureScopeConfig = 
  new SqlSyncScopeProvisioning(azureConn, authorsScopeDesc);
if (!(azureScopeConfig.ScopeExists(authorsScopeDesc.ScopeName)))

Because this is the first time that the scope has been provisioned in the database, there will be some new tables for storing scope information and also a table specifically for tracking the Authors scope that was provisioned in the databases. A good example of a console app to provision or sync a local and SQL Azure database can be found on the Sync Framework Team Blog at bit.ly/dCt6T0.

SQL Azure Data Sync is a cloud-based service hosted in Azure that enables synchronization of entire databases or specific tables between SQL Server and SQL Azure. At the Microsoft Professional Developers Conference 2010, we announced an update to this service called SQL Azure Data Sync Community Technology Preview (CTP) 2. This update provides organizations the ability to easily extend on-premises SQL Server databases to the cloud, allowing for phased migrations of applications to the cloud. Solutions leveraging SQL Azure Data Sync will allow users to continue to access local data and have changes seamlessly synchronized to SQL Azure as they occur. Similarly, any changes made by applications to SQL Azure are synchronized back to the on-premises SQL Server.

Keeping Data in Sync

SQL Azure Data Sync provides a central cloud-based management system for all synchronization relationships. From any browser, administrators can connect to the public service and manage and monitor the various database endpoints. In addition, SQL Azure Data Sync provides a scheduling service that allows synchronization to take place as often as every five minutes, or less frequently if the preference is to run synchronization at off-peak times.

In the recent SQL Azure Data Sync CTP 2 update, we also introduced a new component called the SQL Azure Data Sync Agent. This agent is a Windows Service that’s installed on-premises and links the local SQL Server databases to SQL Azure Data Sync through a secure outbound HTTPS connection, meaning there are no requirements from a firewall or security-configuration standpoint—which makes setup a snap. The Agent’s job is to monitor and log tasks as well as to initiate synchronization requests from the SQL Azure Data Sync.

New Scenarios

With SQL Azure Data Sync, synchronization between SQL Server and SQL Azure databases provides for a wealth of new scenarios that, in the past, were quite difficult to build. Imagine you wanted to share data with your branch offices or retail store databases. With SQL Azure Data Sync, this is easy because administrators create “Sync Groups” that define data to be shared among databases. These Sync Groups could contain a corporate SQL Server that synchronizes data to a centralized SQL Azure “Data Hub.” Then, from this Data Hub, all of the remote or regional SQL Server databases can synchronize data changes, enabling them to bring data closer to the users, while also greatly reducing bandwidth and requirements for Virtual Private Networks, or VPNs.

In addition, the ability to synchronize across multiple SQL Azure datacenters makes it easy to scale out loads across geographies. Imagine you have quarterly reporting requirements that put a huge cyclical load on your SQL Server database. Why not synchronize some of that data to your SQL Azure databases around the world when needed? Then users could access the data closest to them while reducing scalability requirements on your local SQL Server.

—Liam Cavanagh, Senior Program Manager, SQL Azure Data Sync

Synchronizing the Data

Once the databases are properly provisioned, it’s pretty simple to get them synchronizing. It requires the creation of a SqlSyncProvider for each end of the activity with the scope specified. This involves the use of the SyncOrchestrator object, which is the magic behind the curtains that identifies the changes and moves the changes between them. That code looks something like this:

SqlConnection LocalConnection = new SqlConnection(LocalConnectionString);
SqlConnection AzureConnection = new SqlConnection(AzureConnectionString);
SqlSyncProvider LocalProvider = new SqlSyncProvider(ScopeName, LocalConnection);
SqlSyncProvider AzureProvider = new SqlSyncProvider(ScopeName, AzureConnection);
SyncOrchestrator orch= new SynOrchestrator();
orch.LocalProvider = new SqlSyncProvider(ScopeName, LocalConnection);
orch.RemoteProvider = new SqlSyncProvder(ScopeName, AzureConnection);
orch.Direction = SyncDirectionOrder.DownloadAndUpload;

Data and Geographic Dispersion

With simple replication of data handled, I can focus on optimizing the deployment architecture and data flow. Using the Sync Framework, I can specify filters; this in combination with SQL Azure can really be a huge benefit in branch-node architectures. Using the combination of the two, I can put the data closer to the ultimate consumer and optimize the bandwidth usage (and hence charges) by only synchronizing the data that matters for that region or data segmentation. Instead of using data servers in various geographical areas, data can simply be synchronized to an instance of SQL Azure in that geographical area, and those clients in that area can synchronize to it.

By spreading the data around geographically and implementing scopes that make sense for synchronizing particular data with a particular frequency, one can achieve fine-grained control over what, how, when and how much data flows across the wire, improving the user experience as it relates to data availability and freshness. Additionally, for end users who might travel between locations where it would be nice to be location-aware, the sync agent could locate itself and reach out to sync data specifically for the current location. A couple of examples of this are current stats or alerts for workers walking into a manufacturing/plant environment and current-day sales for regional managers of retail chains (see Figure 5).

image: Synchronizing Data with Filters

Figure 5 Synchronizing Data with Filters

Enabling filtering is no harder than provisioning a synchronization scope. Thus, there could be multiple scopes in existence that have different filters—or have none. The needed change is simply to add two lines of code for each filter that’s being added: one line to add a filter column to the table and a second to add the filter clause, which is basically a “where” condition. For my sample, I’m adding a filter based on state and synchronizing only changes for the state of Utah, or UT, like so:

onPremScopeConfig.Tables["authors"].FilterClause = "[authors].[state] = 'UT'";

If I want it to synchronize in both directions based on the filter, it will need to be added to both scopes as they’re provisioned on each end.

Go Forth and Spread the Data

Adding SQL Azure to the mix, whether a single instance or multiple ones, can really enhance the data availability and overall performance when synchronizing to nodes by adding that ever-important layer of indirection. Because it’s SQL Azure, one gets the performance, scalability and reliability without all of the headaches of designing, provisioning and managing the infrastructure. Look for next month’s column, where I’ll expand on the implementation and show how Azure can be added into the mix for synchronization using the latest Sync Framework 4.0 CTP released in October (bit.ly/dpyMP8).

Joseph Fultz is an architect at the Microsoft Technology Center in Dallas, where he works with both enterprise customers and ISVs designing and prototyping software solutions to meet business and market demands. He’s spoken at events such as Tech·Ed and similar internal training events.

Thanks to the following technical expert for reviewing this article: David Browne