Safe deployment practices

Sometimes a release doesn't live up to expectations. Despite using the best practices and passing all quality gates, there are occasionally issues that result in a production deployment causing unforeseen problems for users. To minimize and mitigate the impact of these issues, DevOps teams are encouraged to adopt a progressive exposure strategy that balances the exposure of a given release with its proven performance. As a release proves itself in production, it becomes available to broader audiences until everyone is using it. There are a set of safe deployment practices that teams can employ in order to maximize the quality and speed of releases in production.

Controlling exposure to customers

There are a variety of practices that DevOps teams can employ to control the exposure of updates to their customers. Historically, A/B testing has been a popular choice for teams looking to see how different versions of a service or user interface perform against target goals. It's also relatively easy to use since the changes are typically minor and often only compare different releases at the customer-facing edge of a service.

Safe deployment through rings

As platforms grow, the scale of the infrastructure and audience needs tend to grow as well. This creates a special kind of demand for a deployment model that balances the risks associated with a new deployment with the benefits of the updates it promises. The general idea is that a given release should be first exposed only to a small group of users with the highest tolerance for risk. Then, if the release is working as expected, it can be exposed to a broader group of users. If it's still on track, then the process can continue out through broader groups of users, or rings, until everyone is using it. With modern continuous delivery platforms like GitHub Actions and Azure Pipelines, building a deployment process with rings is accessible to DevOps teams of any size.

Feature flags

Sometimes there's a need for certain functionality to be deployed as part of a release, but not initially exposed to users. In those cases, feature flags provide a solution where the functionality may be enabled via configuration changes based on environment, ring, or any other specific deployment.

User opt-in

Similar to feature flags, user opt-in options provide a way to limit exposure. In this model, a given feature is enabled in the release, but not activated for a user unless they specifically want it. This allows the development team to offload the risk tolerance decision to users so they can decide how quickly they want to adopt certain updates.

It's very common for multiple practices to be employed simultaneously. For example, a team may have an experimental feature intended for a very specific use case. Since it's risky, they'll deploy it to the first ring for internal users to try out. However, even though the features are in the code, someone will need to set the feature flag for a specific deployment within the ring so that the feature is exposed via the user interface. Even then, the feature flag may only expose the option for a user to opt in to using the new feature. Anyone who isn't in the ring, on that deployment, or hasn't opted in won't be exposed to the feature. While this is a fairly contrived example, it serves to illustrate the flexibility and practicality of progressive exposure.

Common issues teams face early on

As teams move toward a more agile DevOps practice, they may find that they run into problems consistent with others who have migrated away from traditional monolithic deliveries. Teams used to deploying once every few months have a mindset that buffers for stabilization. They expect that each deployment will introduce a substantial shift in their service, and that there will be unforeseen issues.

Payloads are too big

When a service is deployed every few months, it's usually filled with many changes. This not only increases the likelihood that there will be immediate issues, but it also makes it difficult to troubleshoot those issues since there is so much new stuff. By moving to more frequent deliveries, the differences in what's deployed become smaller. This allows for more focused testing and easier debugging.

No service isolation

Monolithic systems are traditionally scaled by leveling up the hardware on which they're deployed. However, when something goes wrong with the instance, it leads to problems for everyone. One simple solution is to add multiple instances across which the team can load balance users. However, this can require significant architectural considerations as many legacy systems are not built to be multi-instance. Plus, it may mean allocating significant duplicate resources for functionality that may be better consolidated elsewhere.

As new features are added, teams are encouraged to explore whether a microservices architecture can help them operate and scale thanks to better service isolation.

Manual steps lead to mistakes

When a team is only deploying a few times per year, it may not seem worth the investment to automate deliveries. As a result, many deployment processes are manually managed. This requires a significant amount of time and effort, and is prone to human error. Simply automating the most common build and deployment tasks can go a long way toward reducing lost time and unforced errors.

Teams can also make use of infrastructure as code to have better control over deployment environments. This removes the need for requests to the operations team to make manual changes as new features or dependencies are introduced to various deployment environments.

Only Ops can do deployments

Some organizations have policies that require all deployments to be initiated and managed by the operations staff. While this may have had good reasoning behind it in the past, an agile DevOps process greatly benefits from the ability for the development team to initiate and control deployments. Modern continuous delivery platforms offer granular control over who can initiate which deployments, as well as who can access status logs and other diagnostic information, making sure the right people have the right information as quickly as possible.

Bad deployments proceed and can't be rolled back

Sometimes deployments go wrong and teams need to address it. However, when processes are manual and access to information is slow and limited, it can be difficult to roll back to a previous working deployment. Fortunately, there are a variety of tools and practices for mitigating the risk of failed deployments.

Core principles

Teams looking to adopt safe deployment practices should set some core principles to underpin the effort.

Be consistent

The same tools used to deploy in production should be used in development and test environments. If there are issues, such as the ones that often arise from new versions of dependencies or tools, they should be caught well before the code is close to being released to production.

Care about quality signals

Too many teams fall into the common trap of not really caring about quality signals. Over time, they may find that they write tests or take on quality tasks simply to change a yellow warning to a green approval. Quality signals are really important as they represent the pulse of a project. The quality signals used to approve deployments should be tracked constantly every day.

Deployments should require zero downtime

While it's not critical for every service to always be available, teams should approach their DevOps delivery and operation stages with the mindset that they can and should be deploying new versions without having to take them down for any time at all. Modern infrastructure and pipeline tools are advanced enough now where it's feasible for virtually any team to target 100% uptime.

Deployments should happen during working hours

If a team works with the mindset that deployments require zero downtime, then it doesn't really matter when a deployment is pushed. Further, it becomes advantageous to push deployments during working hours, especially early in the day and early in the week. If something goes wrong, it should be traced early enough to control the blast radius. Plus, everyone will be already be working and focused on getting issues fixed.

Ring-based deployment

Teams with mature DevOps release practices are in a position to take on ring-based deployment. In this model, new features are rolled out to customers willing to take on the highest risk first. As the deployment is proven, its audience is expanded to more and more users until everyone is using it.

An example ring model

A typical ring deployment model is designed to find issues as early as possible through the careful segmentation of users and infrastructure. Below is an example of the rings used by a major team at Microsoft.

Ring Purpose Users Data Center
0 Finds most of the user-impacting bugs introduced by the deployment Internal only, high tolerance for risk and bugs US West Central
1 Areas the team doesn't test extensively Customers using a breadth of the product A small data center
2 Scale-related issues Public accounts, ideally free ones using a diverse set of features A medium or large data center
3 Scale issues in internal accounts and international related issues Large internal accounts and European customers Internal data center and a European data center
4 Remaining scale units Everyone else All deployment targets

Allowing bake time

The term bake time is used to refer to the amount of time a deployment is expected to run in a ring before it can be promoted to the next ring. The idea is that some issues may take hours or longer to start showing symptoms, so the release should be in user for an appropriate amount of time before it's considered ready.

In general, a 24-hour day should be enough time for most scenarios to expose latent bugs. However, this period should include a period of peak usage, so as requiring a full business day for services that peak during business hours.

Expediting hotfixes

When a bug has a serious impact in production, it's known as a live site incident (LSI). LSI's necessitate the creation of a hotfix, which is an out-of-band update designed to address a high-priority issue.

When a bug is Sev 0, the most impactful type of bug, the hotfix may be deployed directly to the impacted scale unit as quickly as responsibly possible. While it's obviously critical that the fix not make things worse, bugs of this severity are considered so disruptive that they must be addressed immediately.

Bugs rated Sev 1 must be deployed through ring 0, but can then be deployed out to the affected scale units as soon as approved.

Hotfixes for bugs with lower severity must be deployed through all rings as planned.

Key takeaways

Every team wants to deliver updates quickly and at the highest possible quality. With the right practices, it's possible to make delivery a productive and painless part of the DevOps cycle.

  • Deploy often
  • Stay green throughout the sprint
  • Use consistent deployment tooling in development, test, and production
  • Use a continuous delivery platform that allows automation and authorization
  • Follow safe deployment practices