Selecting an Appropriate Primary Key for a Distributed Environment

Article
07/07/2014

For tables that participate in incremental synchronization, Sync Framework requires that each row be uniquely identified. This is not required for client and server snapshot synchronization. Typically, rows are identified by a primary key that is defined in the server or peer database. In distributed environments, you must be particularly careful when you select the type of column to use as a primary key. Primary keys must be unique across all nodes, and must not be reused: If a row is deleted, the primary key of that row should not be used for another row. If a primary key is used by more than one node, a primary key collision can occur. This occurs with any kind of distributed environment and is not a limitation of Sync Framework. This topic describes several choices that you can make for primary keys, and describes their appropriateness for distributed environments.

Auto-Increment (Identity) Columns

Database architects often select an auto-increment column to serve as the primary key. This auto-increment property (the IDENTITY property in SQL Server) generates a new value for each record that is inserted into a table. This new value is generated by increasing or decreasing the current value (the seed) by a fixed sum (the increment) and assigning the result to the row that is being inserted. Auto-increment columns typically use compact data types, such as integers. These might result in a more compact clustered index, more efficient joins, and less IO when the underlying table is queried.

However, because the seed and increment properties are fixed and can be selected from a finite number of possible values, the probability of a primary key collision is very high. This type of key is suited for download-only data-caching scenarios. In these scenarios, the server or a designated peer should be the only node that is generating new primary key values. Therefore, these values are guaranteed to be unique across all the nodes in the topology. Auto-increment columns are also suited for upload-only and bidirectional scenarios if insert operations occur at only one node. In these scenarios, insert operations typically occur only at the server or a designated peer; and update operations, and possibly delete operations, occur at one or more clients. If you require insert operations at more than one node, you should use one of the other approaches described later in this topic.

Be aware that, when a table in a SQL Server Compact database is initialized through synchronization, the seed and increment of its identity columns are always set to 0 and 1, respectively, regardless of how they are set on the server database. If you need to insert rows locally in the client database you must first set the identity seed and increment to appropriate values or manually specify the identity column value, otherwise new rows will have duplicate identity values and row insertion will fail. The identity seed and increment can be set with the following query, which sets the seed value to 10 and the increment to 1:

ALTER TABLE ExampleTable ALTER COLUMN idCol IDENTITY(10,1)

GUIDs

Using a GUID (a uniqueidentifier column in SQL Server) as the primary key guarantees uniqueness across any number of nodes and eliminates the primary key collisions possible with auto-increment columns. However, using a GUID in the primary key has the following consequences:

The large data type (16 bytes) increases the size of the clustered index, which can adversely affect common operations, such as joins.
The unordered generation of GUIDs causes rows to be inserted into random locations in the clustered index. This in turn can cause a fragmented clustered index. This can adversely affect the IO required to query the underlying table.

In SQL Server 2005 and later versions, you can use the NEWSEQUENTIALID() function generate GUIDs in a sequential order to help eliminate this fragmentation.

Keys That Include a Node Identifier

In this approach, you use a key that combines a value that is unique at the server or client node with a value that is unique across the topology. For example, in client and server synchronization you could use an auto-increment column (unique at the node) combined with a column that stores a hash of the ID that Sync Framework assigns to each client. (This is the ClientId that is unique across the topology.) You would then create a composite primary key that has these two columns. As an alternative, you could develop a system to generate values for each inserted row so that you could include the row ID and the client ID in one column.

Natural Keys

With this strategy, you do not use any kind of manufactured key, but instead use a business key to uniquely identify records. For example, a table that is used to store customer information might use the social security number column as the primary key instead of an identity column. The drawback to this approach is that the primary key might become large if more than one column is required to uniquely identify a record. Furthermore, this compound key must be propagated to other tables to support one or more foreign key relationships. These relationships, in turn, adversely affect join performance.

Online Insertion

If none of the previous solutions are appropriate and your scenario requires just a few insert operations at a client, it might be feasible for the application to directly insert these rows at the server. The new rows will then be downloaded and inserted at the client during the next synchronization. Because the primary key values are generated at the server, primary key collisions will not occur.

Selecting an Appropriate Primary Key for a Distributed Environment

Auto-Increment (Identity) Columns

GUIDs

Keys That Include a Node Identifier

Natural Keys

Online Insertion

See Also

Concepts

Additional resources