Achieving no downtime through versioned service updates

Historically, on-premises software typically required administrators to take a server offline for updates and upgrades. However, downtime is a complete nonstarter for global 24×7 services. Many modern cloud services are a critical dependency for their users because they count on them in order to run their businesses. There's never a good time for taking a system down, so how can a team provide continuous service across challenging updates?

It's a given that these critical services need to be updated online. In other words, they need to provide seamless transitions from one version to another while customers are actively using them. This isn't always hard. Updating front-end layouts or styles is easy. Changes to features can be tricky, but there are well-known practices to mitigate migration risks. However, changes that emanate from the data tier introduce a new class of challenges that require special consideration.

Update layers separately

With a distributed online service in multiple datacenters and separate data storage, not everything can change simultaneously. If the typical service is split into application code and databases, which are presumably versioned independently of each other, one of those needs to absorb the complexity of handling versioning.

More often than not, it's easier to handle it in the application code. Larger systems usually have quite a bit of legacy code, such as SQL that lives inside its databases. So instead of further complicating the SQL, handle the complexity in application code. Specifically, create a set of factory classes that understand the SQL versioning.

Every sprint, create a new interface with that version. That way there's always code that matches each database version. This allows for easy rollback of binaries during deployment. The first thing to do is to deploy the new binaries. If something goes wrong, roll them back to revert to the previous code. If the binary deployment succeeds, then start the database servicing.

So how does this actually work? Assume the team is currently deploying Sprint 123. The binaries understand Sprint 123 database schema and they understand Sprint 122 schema. The general pattern is to work with both versions/sprints N and N-1 of the SQL Schema. The binaries query the database, determine what schema version they are talking to, and then load the appropriate binding. Then, the application code handles the case when the new data schema is not yet available. Once the new version is available, the application code can start making use of the new functionality that is enabled by the latest database version.

Roll forward only with the data tier

Once the databases are upgraded, the service is in a roll-forward situation if a problem occurs. Online database migrations are pretty complex and often multi-step, so rolling forward is usually the best way to address a problem. In other words, if the upgrade fails, then it's likely the rollback would fail as well. There's little value in investing in the effort to build and test rollback code that the team never expects to use.

Deployment sequence

Consider a scenario where the team needs to add a set of columns to a database and transform some data. This needs to be done invisibly to the user, which means avoiding table locks as much as possible and then holding locks for the shortest time possible so that they are not perceptible.

The first thing we do is to manipulate the data, possibly in parallel tables using a SQL trigger to keep the data in sync. This also means that large data migrations and transformations sometimes have to be multi-step over several deployments across multiple sprints.

Once the extra data or new schema has been created in parallel, the team goes into deployment mode for the application code. In deployment mode, when the code makes a call to the database, it first grabs a lock on the schema and then releases it after running the stored procedure to keep the database from changing between the time the call to the database is issued and when the stored procedure runs.

The upgrade code acts as a schema writer and requests a writer lock on the schema. The application code takes priority in taking a reader lock, and the upgrade code sits in the background trying to acquire the writer the lock. Under the writer lock, only a small number of very fast operations are allowed on the tables. Then the lock is released and the application records the new version of the database is in use and uses the interfaces that match the new database version.

The database upgrades are all performed using a migration pattern. There is a set of code and scripts that look at the version of the database and then make the incremental changes to migrate the schema from the old to the new version. All migrations are automated and rolled out via release management service.

It's also necessary to update the web UI without disrupting users. When upgrading JavaScript files, style sheets, or images it's necessary to avoid having a mix of old and new versions being loaded by the client. That can lead to errors that could lose work in progress, such as a field being edited by a user. Therefore, version all JavaScript, CSS and image files by having all files associated with a deployment in a separate, versioned folder. When the web UI makes calls back to the application tier, it's loading assets with a specified version. Only when a user action results in a full page refresh does the new web UI get loaded into the browser. This ensures that the user's experience is never disrupted by the upgrade.