How Microsoft develops modern software with DevOps

Microsoft actively pursues a strategy of using One Engineering System throughout the company. This initiative envisions a modern system where we build and deliver all of our products using a solid DevOps process centered on a Git-centric release flow.

One of the questions that we're often asked is how we use version control and branching to deliver changes safely to production. Not only do the requirements of different organizations within Microsoft vary greatly, but requirements of different teams within a given organization scale with size and complexity. Adopting a standardized development process is an ambitious undertaking.

To address these varied needs, Microsoft uses a trunk-based branching strategy to help us develop products quickly and deploy them regularly. In this article, we'll focus on the practical implementation and lessons learned from applying this process across various teams within Microsoft. We'll cover how this strategy scales to our development needs, from small services to massive platforms.

Our release flow

Our release flow encompasses the entire DevOps process from development to release. Every organization should settle on a standard process like this in order to ensure consistency across teams. We'll summarize this flow below and then dig into the implementation details and lessons further down.

Branch

When a developer wants to fix a bug or implement a feature, they create a new branch off of our main integration branch. Thanks to Git's lightweight branching model, we create these short-lived topic branches any and every time we want to write some code. Developers are encouraged to commit early and to avoid long-running feature branches by using feature flags.

Push

When the developer is ready to get their changes integrated and ship their changes to the rest of the team, they push their local branch to a branch on the server, and open a pull request. Since we have several hundred developers working in our repository, each with many branches, we use a naming convention for branches on the server to help alleviate confusion and what we call branch proliferation. Generally, developers create a local branch named users/<username>/feature, where <username> is replaced with their account name.

Pull request

We use pull requests to control how topic branches are merged into main. Pull requests ensure that branch policies are satisfied. For example, we build the proposed changes and run a quick test pass. Our first- and second-level test suites include around 60,000 tests that run in just under five minutes. This isn't our complete test matrix, but it's enough to quickly give us confidence in pull request.

Next, we require that other members of the team review the code and approve the changes. Code review picks up where the automated tests left off, and are particularly good at spotting architectural problems. Manual code reviews ensure that more engineers on the team have visibility into the changes and that code quality remains high.

Merge

Once all the build policies are satisfied and reviewers have signed off, then the pull request is completed. This means that the topic branch is merged into the main integration branch, main.

After merge, we run additional acceptance tests that take more time to complete. These are more like traditional post-checkin tests and we use them to perform an even more thorough validation. This gives us a good balance between having fast tests during the pull request review while still having complete test coverage before release.

Important lessons learned over time

Now that you have a firm understanding of the general workflow of a developer contributing code into the repo, let's discuss some of the key implementation details.

Our release flow vs. GitHub Flow

A very popular trunk-based development release flow used by many organizations is GitHub Flow. It provides a great starting point for organizations looking to implement a reasonably scalable approach to Git.

However, some organizations may find that as their needs grow, it's necessary to diverge from parts of the GitHub Flow. For example, an often overlooked part of GitHub Flow is that pull requests are actually delivered directly to production to test them before they're merged into main. This means that developers need to wait in the deployment queue to test their changes before they can merge their pull requests.

Some teams have several hundred developers working constantly in a single repository, and can complete over 200 pull requests into main per day. If each of those pull requests required a deployment to multiple Azure data centers across the globe, our developers would lose time waiting in the queue to deploy their branches instead of writing software.

Instead, we continue developing in our main branch and batch up deployments into three week blocks, aligned with our sprint cadence.

Git repository strategy

Different teams opt for different strategies when it comes to managing their Git repositories. For some teams, the majority of their code is in one Git repository. Code is broken up into components, which each live in their own root-level folder. Really large components, especially older components, may be made up of multiple subcomponents. Those subcomponents get separate sub-folders within the parent component.

Git repository structure

Some teams also manage a few adjunct repositories, as well. For instance, the build & release agent and tasks, the VS Code extension, and more are developed in the open on GitHub. Configuration changes are checked into a separate repository. A handful of other packages that team depends on come from other places and are consumed via NuGet.

Mono repo or multi-repo with Git

While some teams elect to have a single monolithic repository (the mono-repo), other products at Microsoft use a multi-repo approach. Skype, for instance, has hundreds of small repositories that get stitched together in various combinations to create their many different clients, services, and tools. Especially for teams embracing microservices, multi-repo can be the right approach. Usually older products that began as a monolith end up finding a mono-repo approach to be the easiest transition to Git, and their code organization reflects that.

Git branch structure and policies

Our release flow lets us keep main buildable at all times and work from short-lived topic branches. When we're ready to ship, whether that's a sprint or a major update, we start a new release branch off main. Release branches never merge back to main, so we require cherry-picking important changes. In the diagram below, short-lived branches are shown in light blue and the release branches are shown in dark blue. One branch with a commit that needs cherry-picking is shown in red.

Git branch structure and policies

We use a couple of Git branching features to help enforce this structure and keep main clean. Branch policies prevent direct pushes to main. We require a successful build (including passing tests), signoff by the owners of any code that was touched, and a handful of external checks verifying corporate policies before a pull request can be completed.

Sign off policy

We also like to keep our branch hierarchy tidy. We use permissions to block creation of branches at the root level of the hierarchy. Everyone can create branches in folders like users/, features/, and teams/. Only release managers have permission to create branches under releases/, and some automation tools have permission to the integrations/ folder.

Branches

Working in the Git repository

Within this structure, how do engineers actually get their daily work done? Obviously the environment's going to vary heavily by team and by individual. Some people like the command line, others like Visual Studio, and others work on different platforms. But the structures and policies in place on our repository ensure a solid and consistent foundation. Let's walk through a handful of common tasks.

Git workflow to build a new feature

Our first stop has to be building a new feature. That's the core of a software engineer's job, right? We'll skip past the non-Git parts like looking at telemetry data, coming up with a design and a spec, and even writing the actual code. Let's jump right in to working with the repository.

The engineer starts by syncing to the latest commit on main. We keep main always buildable, so this is virtually guaranteed to be a good starting point. The developer checks out a new feature branch, makes code changes, commits, and pushes to the server. When our engineer starts a pull request, several interesting things happen.

Using Git branch policy

Upon the creation of a pull request, automated systems check that the new code builds, hasn't broken anything, and hasn't violated any policies related to security, compliance, and so on. This doesn't block other work from happening in parallel. Most teams have configured integration with Microsoft Teams, which announces the new pull request to the engineer's colleagues. The owners of any code touched are automatically added as reviewers. We make liberal use of optional reviewers for code that many people touch, like REST client generation and shared controls, as a way to get expert eyes on those changes.

Teams integration

Once the people and the automation are satisfied, our engineer completes the pull request. If there's a merge conflict, the engineer is given instructions on how to sync to the conflict, fix it, and re-push the changes. The automation all runs again on the fixed code, but humans don't have to sign off again.

The branch is merged into main, and the new code will deploy in the next sprint or major release. Importantly, that doesn't mean the new feature will show up right away. We decouple the deployment and exposure of new features using feature flags. This means even if the feature needs a little more time before it's ready to show off, it's safe to go to main if the product builds and deploys. Once in main, the code ends up in an official build, where it's again tested, confirmed to meet policy, and digitally signed.

Git facilitates our shift left

Working this way with Git gives us a number of benefits. First, we work out of a single main, virtually eliminating merge debt. Second, the pull request flow gives us a common point to force testing, code review, and error detection early in the pipeline. This helps us shorten the feedback cycle to developers since errors are usually detected in minutes, not hours or days. Also, it gives us confidence when we refactor since all changes are tested all the time.

Currently, a product with 200+ pull requests may produce 300+ continuous integration builds per day. Together, that amounts to 500+ test runs every 24 hours, a level that would have been unthinkable without this workflow.

Releases at sprint milestones

At the end of each sprint, we create a deployment branch from the main branch. For example, at the end of sprint 129, we create a new branch releases/M129. We then put the sprint 129 branch into production.

Once we've branched to our deployment branch, the main branch remains open for developers to merge changes. These changes, of course, do not get deployed to production. They'll be deployed within the next three weeks as part of the next sprint deployment.

Illustration of release branch at sprint 129

Releasing hotfixes

Sometimes changes need to go to production more quickly. We generally won't add big new features in the middle of a sprint, but occasionally we want to bring a bug fix in quickly to unblock users. Sometimes they may be minor, such as typos. However, sometimes we have a bug that causes an availability issue, which we call a live site incident.

When this happens, we start with our normal workflow. We create a branch from main, get it code reviewed, and complete the pull request to merge it. We always start by making the change in main first. This allows us to create the fix quickly and validate it locally without having to switch to the release branch locally.

More importantly, by following this process, we're guaranteed that our change goes into main. This is critical for us. If we were to fix a bug in the release branch first, and accidentally forget to bring the change back to main, we would have a recurrence of the bug during the next deployment when we create our sprint 130 release branch from main.

It's particularly easy to forget to do this during the confusion and stress that can arise during an outage. So by always bringing our changes to main first, we know that we'll always have our changes in both the main branch and our release branch.

Git functionality enables this workflow. From the Pull Request page, you can cherry-pick a pull request onto a different branch. To bring changes immediately into production, once we have merged the pull request into main, we cherry-pick the change into the release branch. This creates a new pull request that targets the release branch, backporting the contents that were just merged into main.

Illustration of cherry-picking a hotfix commit into branch 129

By opening a new pull request, we get traceability and reliability from branch policies. Using cherry-pick functionality allows us to do it quickly. We don't need to download the release branch to a local computer to cherry-pick the changes. It's all handled efficiently on the server. And if we need to make changes, to fix merge conflicts, or make minor changes due to differences between the two branches, we can do that on the server, too. We can edit changes directly from the browser-based text editor or via Pull Request Merge Conflict Extension for a more advanced experience.

Once we have a pull request targeting our release branch, we'll code review it again, evaluate the branch policies, and test it. Once it's merged, it will get deployed to our first ring of servers in minutes. From there, we'll progressively deploy it to more accounts using deployment rings. As more users are exposed to the changes, we'll monitor success and ensure that our change has fixed the bug while not introducing any new deficiencies or slowdowns as the fix is deployed to the rest of our data centers.

Moving on to the next sprint

After three weeks, we'll finish adding features to sprint 130, and we'll be ready to deploy those changes. To deploy, we'll create the new release branch, releases/M130 from main, and deploy that.

At this point, we'll actually have two branches in production. Since we use a ring-based deployment to bring changes to production safely, our fast ring will get the sprint 130 changes, while our slow ring servers will stay on sprint 129 while the new changes are validated in production.

This raises an interesting problem. If we need to hotfix a change in the middle of a deployment, we may need to hotfix two different releases: the sprint 129 release and the sprint 130 release. In these cases, we'll port the hotfix to both release branches and deploy both release branches. The 130 branch will redeploy with the hotfix to the rings that have already been upgraded. The 129 branch will redeploy with the hotfix to the outer rings that haven't been upgraded to next sprint's version yet.

Once all the rings have been deployed, our old branch from sprint 129 is completely abandoned. We'll never need it again, since we were very careful to ensure that any changes that we brought into the sprint 129 branch as a hotfix was also made in main. So those changes will also be in the releases/M130 branch that we create.

Illustration of release branch at sprint 130

Summary

Our release flow model is at the heart of how Microsoft develops with DevOps. It allows us to use a simple, trunk-based branching strategy for our online service. But instead of keeping our developers stuck in a deployment queue, waiting to be able to merge their changes, our developers can keep working.

It also enables us to deploy new features across all our Azure data centers at a regular cadence, and despite the size of our codebases and the number of developers working in them, we can bring hotfixes into production quickly and efficiently.