How Microsoft delivers quality services with DevOps

Microsoft has developed decades of experience delivering highly scalable services to production environments. Just as those environments have expanded beyond anything comparable to those hosting our services thirty years ago, so too have the practices we've evolved over that time.

As our DevOps processes have matured, we've identified a set of core principles that fundamentally apply to virtually any modern software effort. We've used this experience to help thousands of major customers adopt more efficient delivery practices so that they can benefit from the same opportunities we've been using for years.

These principles cover three major initiatives that any company can adopt to improve their DevOps delivery:

  1. Change the organizational mindset and cadence to focus on delivery.
  2. Engineer systems to be owned, tested, and delivered by accountable teams.
  3. Shift right to test in production.

Change the organizational mindset

Organizations always want to ship faster. It's the most obvious benefit that any team will be able to easily measure and appreciate. However, there is often resistance to this effort from those who are legitimately concerned about product stability. The typical DevOps cadence involves short cycles with regular deployments to production. To compromise, teams may try to adopt a sprint cycle that includes a stabilization period. Unfortunately, this incentivizes the wrong behavior.

Change the cadence

So what should you do first? Change the cadence.

Some teams will naturally do the right thing and burn down their debt. Other teams will keep building their debt because their engineers want to ship as many features as possible during the sprint. As a result, teams who managed their debt will be called on to support the greedier teams in paying down their debt during stabilization. These costs play themselves out through the pipelines and into production.

Paradoxically, removing the stabilization period quickly has the effect of improving the way teams manage their debt. Instead of pushing off key maintenance work to the stabilization period, those teams quickly learn that they'll be stuck with spending the next sprint catching up to the debt targets set by the organization. After one cycle they'll understand that features get delivered when they're proven and worth the cost of deployment.

Team autonomy and organizational alignment

Another thing organizations need to focus on, to some degree, is letting go. Management appreciates the security of having plans laid out up front. But while it's great to believe in specific dates, they're really just false security. Teams need to be able to run with their own backlogs, their own plans, and somehow find a way to align them.

Executives often as us about how we run our feature chats and sprint demos. They're looking for specific and measurable targets, but it's not always that easy. There really aren't KPIs that measure team productivity or performance, and you also can't use them to project whether a feature is on track. You need to have discussions with the teams where they tell and show you where things are. The tools facilitate it, but conversation is the most transparent way to communicate.

Engineer systems for better ownership and accountability

Much of the improvement teams can gain immediately is to automate the pipelines their code uses to get from repository to production. We'll assume that you already understand the benefits of creating release pipelines with continuous integration and automated testing. But there's a lot more than can be done to subtly improve the efficiency of teams working towards an optimized DevOps process.

An important goal for teams new to DevOps delivery is to always deliver features. Building schedules is great insofar as they offer an exercise for teams and individuals to assess what can reasonably be completed over a given period of time. But when it comes down to actually delivering features, just expect that some will come earlier and some will come later. As long as there's a focus on always delivering features, the work can be prioritized and the most important features will make it to production eventually.

The organizational benefits of microservices

Microservices offer a variety of technical benefits that generally improve and simplify delivery. However, they can also provide great natural boundaries for team ownership. When a team has true autonomy for the investment in a given service, they have the ability to prioritize how features are implemented and debt is managed. They can also better focus on plans for things like versioning independently from the services dependent on what they provide.

Work in main

We used to have engineers working in separate branches. Merge debt on each was hidden until the developer would try to integrate. Unsurprisingly, the more teams you have, the bigger that integration becomes. How can you get that integration to happen in small chunks and faster, more continuously? The key is to work in main. One of the reasons why we moved to Git is the lightweight branching. The really big benefit to our internal engineering was getting rid of that deep branch hierarchy and the waste it introduced. All the time that used to be spent integrating now gets poured into delivery.

Walk the walk

Because we use the tools that we build, our one investment yields benefit both in our productivity and improving our products. For example, it's really important that we have a release management system that we ship to everybody else and that we use ourselves, instead of something secondary that then siphons off a bunch of velocity from the team.

Continuous Deployment

The less frequently you deploy, the harder it is to deploy. The more you have time between deployments, the more stuff piles up. Suddenly, the code isn't fresh in anyone's mind, so you get deployment debt. The more you can work in smaller chunks, the easier the actual deployment becomes. Teams used to avoid deploying because it was so hard, but that actually just made it harder. It seems obvious in hindsight, but at the time it was counterintuitive. Deploying more frequently made us prioritize making the tools and pipelines we use to deploy more efficient and reliable.

Shift right to test in production

We recommend a general shift right to test in production. This helps teams ensure both that their ever-changing production environments are ready to handle the deployments, as well as make sure that the tests they're running prior to production are valid.

Use resiliency patterns

A major risk for any complex deployment is the risk of cascading failures. These occur when one component fails, which causes components that depend on it to fail, and so on, until the entire system breaks down. Understand where your single points of failure (SPOFs) are and how they're being mitigated. Also be sure to test those mitigation processes, especially in production.

Feature flags enable progressive experimentation

There are going to be times when a team can't completely finish a feature in time for a sprint deployment. However, there is often benefit from deploying the current version of it for testing in production. The key here is to control exposure through the use of feature flags. This allows teams to merge and deploy their code without risking significant problems with the overall user base. Instead, they can turn the feature on for specific segments, such as the development team or a small segment of early adopters, in order to determine if and how to complete it.

Instrument everything

Regardless of where an app is deployed, it's really important to instrument everything. This instrumentation not only helps identify and fix issues with the current version, but it also provides invaluable research on what's being used and what we should add next.

Getting metrics right

Designing metrics can be as hard as designing features. A common mistake is to include too many metrics because it's easy and ensures you don't miss anything. However, you'll find yourself ignoring and not trusting the value behind many of them unless you've had a specific need. Instead, have the team take some time to really think through the kind of data points they need to measure success. You can always add or change them later on, but having a defensible set up front will make that process better later on.

Besides the basis of a given metric, such as the total number of users, think about what the metric is really designed to articulate. Often a better metric will be the velocity or acceleration of user gains. This type of metric is going to vary from project to project, but favor those with the potential to drive changes to the business over vanity metrics.

You're not done until telemetry confirms you're done

We bake metrics into our reviews up to the highest levels of leadership. We present every six weeks how we're doing on health, our business, our scenarios, our customer telemetry. We discuss it all with the executives and then bring it down to the teams. We look at those same engaged user metrics and ask, "What does that mean for your feature?" People all through the org can say, "Oh I don't just ship the feature, but now I go and look and see, Are people using it? Or do I need to adjust the backlog to work on this feature more to make it achieve its goals?"

Summary

It's never a straight line to get from A to B, nor is B the end. There will always be setbacks and mistakes. But those should be viewed as learning opportunities that may change the tactics for how a given part of the process is completed. Over time, every team evolves its DevOps practices when they build on experience and adjust to meet needs as they change. They key thing is to focus on delivering value every day, whether it's to end users or to the process itself.