Stay Away From Auto Increment Columns

I had few conversations on the sync services forum about that ability to synchronize a table with auto-increment column, also known as identity column, as the primary key or unique id for the row. It is rather common to see this pattern of primary key set to auto increment integer type in every table and then establish foreign key relationships across different tables in the database. Unfortunately, auto-increment columns are not designed with sync in mind.  

Unique ID

Before we go any further, let me state this clearly that sync services must be able to uniquely identity each row undergoing sync. By definition, the unique key must not be reintroduced in the system by any other node.

You might get the feeling that this requirement is too restrictive and it would hinder many scenarios. Well, you will be surprised to know that this is such a basic requirement that all sync and replication systems are built upon! For example, Merge replication adds a GUID column to every table it synchronizes, you cannot delete that column (try it); file replication service in windows uses FID that NTFS generates to uniquely identify each file; active directory replication requires a GUID for every object …etc

Data Caching Scenarios

For download only or caching scenarios where the data is cached on the client for easy reference, there is no problem in synchronizing auto increment columns. This is because the server is the only node generating row IDs which is guaranteed to be unique across the system. Clients are not changing the data thus eliminating any ID collisions.

Bidirectional Sync Scenarios

When you have many clients generating same ID and uploading data to the server, the problem manifests itself clearly. Many clients will end up generating a row with ID of 17, for example. These rows are actually different objects yet have the same identity.. This scenario does not work. I know some people tried few tricks to get this scenario to work, which I acknowledge is possible for simple cases, but once you add foreign key relationships to the mix the complexity will shot high up, take my word for it.

Possible Solutions

  • Use Unique Keys
    The best solution and will save you a lot of effort on the long run, is to eliminate the risk of reusing the same key elsewhere in the system by generating a globally unique key for every new row. Few ideas of a unique key generation:
    • Use Guid 
    • Use a hash of machine unique MAC address  + local incrementing number (i.e. timestamp, or application controlled number)
  • Online Insertions
    If the row inserts goes to the server directly then the server will be the only node that generates IDs which guarantees uniqueness. That said, if you must use auto-increment columns and your clients does not make inserts very often, then you might want to consider making inserts directly to the server database followed by a sync to get the rows back to the client database. Updates and deletes will remain offline.
  • ID Ranges
    This is the same technique used by merge replication today. It defines a unique range of IDs for each client and thus avoids any overlaps between clients. Sync Services allows you to do that only once when the schema is created on the client. Unlike merge, clients cannot go back to the server to request a new range. This might be addressed in future versions of the sync services based on the feedback we receive from developers.

This is a important problem for us and we are looking into addressing this in future versions with new and innovative techniques but for now try to avoid auto increment columns for your offline applications.

 

Update: Just to let you know, I left Microsoft to start a new company, Raveable Hotel Reviews . See examples: Top 10 Hotels in Myrtle Beach, Best Hotels in New York City, Romantic Hotels in Seattle , Hotels with a kitchen or kitchenette and Top 10 Hotels in San Francisco . Your feedback is welcome on twitter.com/raveable, raveable blog.