Eliminate downtime through versioned service updates

Article
11/28/2022

Historically, administrators needed to take a server offline to update and upgrade on-premises software. However, downtime is a complete nonstarter for global 24×7 services. Many modern cloud services are a critical dependency for users to run their businesses. There's never a good time to take a system down, so how can a team provide continuous service while installing important security and feature updates?

By using versioned updates, these critical services can be transitioned seamlessly from one version to another while customers are actively using them. Not all updates are hard. Updating front-end layouts or styles is easy. Changes to features can be tricky, but there are well-known practices to mitigate migration risks. However, changes that emanate from the data tier introduce a new class of challenges that require special consideration.

Update layers separately

With a distributed online service in multiple datacenters and separate data storage, not everything can change simultaneously. If the typical service is split into application code and databases, which are presumably versioned independently of each other, one of those sides needs to absorb the complexity of handling versioning.

Often, versioning is easier to handle in the application code. Larger systems usually have quite a bit of legacy code, such as SQL that lives inside its databases. Rather than further complicating this SQL, the application code should handle the complexity. Specifically, you can create a set of factory classes that understand SQL versioning.

During every sprint, create a new interface with that version so there's always code that matches each database version. You can easily roll back any binaries during deployment. If something goes wrong after deploying the new binaries, revert to the previous code. If the binary deployment succeeds, then start the database servicing.

So how does this actually work? For example, assume that your team is currently deploying Sprint 123. The binaries understand Sprint 123 database schema and they understand Sprint 122 schema. The general pattern is to work with both versions/sprints N and N-1 of the SQL schema. The binaries query the database, determine which schema version they're talking to, and then load the appropriate binding. Then, the application code handles the case when the new data schema isn't yet available. Once the new version is available, the application code can start making use of the new functionality that's enabled by the latest database version.

Roll forward only with the data tier

Once databases are upgraded, the service is in a roll-forward situation if a problem occurs. Online database migrations are complex and often multi-step, so rolling forward is usually the best way to address a problem. In other words, if the upgrade fails, then the rollback would likely fail as well. There's little value in investing in the effort to build and test rollback code that your team never expects to use.

Deployment sequence

Consider a scenario where you need to add a set of columns to a database and transform some data. This transition needs to be invisible to users, which means avoiding table locks as much as possible and then holding locks for the shortest time possible so that they aren't perceptible.

The first thing we do is manipulate the data, possibly in parallel tables using a SQL trigger to keep data in sync. Large data migrations and transformations sometimes have to be multi-step over several deployments across multiple sprints.

Once the extra data or new schema has been created in parallel, the team goes into deployment mode for the application code. In deployment mode, when the code makes a call to the database, it first grabs a lock on the schema and then releases it after running the stored procedure. The database can't change between the time the call to the database is issued and when the stored procedure runs.

The upgrade code acts as a schema writer and requests a writer lock on the schema. The application code takes priority in taking a reader lock, and the upgrade code sits in the background trying to acquire the writer lock. Under the writer lock, only a small number of very fast operations are allowed on the tables. Then the lock is released and the application records the new version of the database is in use and uses the interface that matches the new database version.

The database upgrades are all performed using a migration pattern. A set of code and scripts look at the version of the database and then make incremental changes to migrate the schema from the old to the new version. All migrations are automated and rolled out via release management service.

The web UI must also be updated without disrupting users. When upgrading JavaScript files, style sheets, or images, avoid mixing old and new versions being loaded by the client. That can lead to errors that could lose work in progress, such as a field being edited by a user. Therefore, you should version all JavaScript, CSS, and image files by putting all files associated with a deployment into a separate, versioned folder. When the web UI makes calls back to the application tier, assets with a specified version are loaded. Only when a user action results in a full page refresh does the new web UI get loaded into the browser. The user's experience isn't disrupted by the upgrade.

Next steps

Microsoft has been one of the world's largest software development companies for decades. Learn how Microsoft operates reliable systems with DevOps.