How Microsoft develops modern software with DevOps
Microsoft actively pursues a strategy of using One Engineering System throughout the company. This initiative envisions a modern system where we build and deliver all of our products using a solid DevOps process centered on a Git-centric release flow.
One of the questions that we're often asked is how we use version control and branching to deliver changes safely to production. Not only do the requirements of different organizations within Microsoft vary greatly, but requirements of different teams within a given organization scale with size and complexity. Adopting a standardized development process is an ambitious undertaking.
To address these varied needs, Microsoft uses a trunk-based branching strategy to help us develop products quickly and deploy them regularly. In this article, we'll focus on the practical implementation and lessons learned from applying this process across various teams within Microsoft. We'll cover how this strategy scales to our development needs, from small services to massive platforms.
Our release flow
Our release flow encompasses the entire DevOps process from development to release. Every organization should settle on a standard process like this in order to ensure consistency across teams. We'll summarize this flow below and then dig into the implementation details and lessons further down.
Branch
When a developer wants to fix a bug or implement a feature, they create a new branch off of our main
integration branch. Thanks to Git's lightweight branching model, we create these short-lived topic
branches any and every time we want to write some code. Developers are encouraged to commit early and to
avoid long-running feature branches by using
feature flags.
Push
When the developer is ready to get their changes integrated and ship their changes to the rest of the team,
they push their local branch to a branch on the server, and open a pull request. Since we have several
hundred developers working in our repository, each with many branches, we use a naming convention for
branches on the server to help alleviate confusion and what we call branch proliferation. Generally,
developers create a local branch named users/<username>/feature
, where <username>
is replaced
with their account name.
Pull request
We use pull requests to control how topic branches are merged into main
. Pull requests ensure that branch
policies are satisfied. For example, we build the proposed changes and run a quick test pass. Our first- and
second-level test suites include around 60,000 tests that run in just under five minutes. This isn't our
complete test matrix, but it's enough to quickly give us confidence in pull request.
Next, we require that other members of the team review the code and approve the changes. Code review picks up where the automated tests left off, and are particularly good at spotting architectural problems. Manual code reviews ensure that more engineers on the team have visibility into the changes and that code quality remains high.
Merge
Once all the build policies are satisfied and reviewers have signed off, then the pull request is
completed. This means that the topic branch is merged into the main integration branch, main
.
After merge, we run additional acceptance tests that take more time to complete. These are more like traditional post-checkin tests and we use them to perform an even more thorough validation. This gives us a good balance between having fast tests during the pull request review while still having complete test coverage before release.
Important lessons learned over time
Now that you have a firm understanding of the general workflow of a developer contributing code into the repo, let's discuss some of the key implementation details.
Our release flow vs. GitHub Flow
A very popular trunk-based development release flow used by many organizations is GitHub Flow. It provides a great starting point for organizations looking to implement a reasonably scalable approach to Git.
However, some organizations may find that as their needs grow, it's necessary to diverge from parts of
the GitHub Flow. For example, an often overlooked part of GitHub Flow is that pull requests are actually
delivered directly to production to test them before they're merged into main
. This means that
developers need to wait in the deployment queue to test their changes before they can merge their pull
requests.
Some teams have several hundred developers working constantly in a single repository, and can complete
over 200 pull requests into main
per day. If each of those pull requests required a deployment to
multiple Azure data centers across the globe, our developers would lose time waiting in the queue to
deploy their branches instead of writing software.
Instead, we continue developing in our main
branch and batch up deployments into three week blocks,
aligned with our sprint cadence.
Git repository strategy
Different teams opt for different strategies when it comes to managing their Git repositories. For some teams, the majority of their code is in one Git repository. Code is broken up into components, which each live in their own root-level folder. Really large components, especially older components, may be made up of multiple subcomponents. Those subcomponents get separate sub-folders within the parent component.
Some teams also manage a few adjunct repositories, as well. For instance, the build & release agent and tasks, the VS Code extension, and more are developed in the open on GitHub. Configuration changes are checked into a separate repository. A handful of other packages that team depends on come from other places and are consumed via NuGet.
Mono repo or multi-repo with Git
While some teams elect to have a single monolithic repository (the mono-repo), other products at Microsoft use a multi-repo approach. Skype, for instance, has hundreds of small repositories that get stitched together in various combinations to create their many different clients, services, and tools. Especially for teams embracing microservices, multi-repo can be the right approach. Usually older products that began as a monolith end up finding a mono-repo approach to be the easiest transition to Git, and their code organization reflects that.
Git branch structure and policies
Our release flow lets us keep main
buildable at all times and work from short-lived topic branches.
When we're ready to ship, whether that's a sprint or a major update, we start a new release branch off
main
. Release branches never merge back to main
, so we require cherry-picking important changes.
In the diagram below, short-lived branches are shown in light blue and the release branches are shown in
dark blue. One branch with a commit that needs cherry-picking is shown in red.
We use a couple of Git branching features to help enforce this structure and keep main
clean. Branch
policies prevent direct pushes to main
. We require a
successful build (including passing tests),
signoff by the owners
of any code that was touched, and a handful of
external checks verifying
corporate policies before a pull request can be completed.
We also like to keep our branch hierarchy tidy. We use permissions to block creation of branches at the
root level of the hierarchy. Everyone can create branches in folders like users/
, features/
, and
teams/
. Only release managers have permission to create branches under releases/
, and some
automation tools have permission to the integrations/
folder.
Working in the Git repository
Within this structure, how do engineers actually get their daily work done? Obviously the environment's going to vary heavily by team and by individual. Some people like the command line, others like Visual Studio, and others work on different platforms. But the structures and policies in place on our repository ensure a solid and consistent foundation. Let's walk through a handful of common tasks.
Git workflow to build a new feature
Our first stop has to be building a new feature. That's the core of a software engineer's job, right? We'll skip past the non-Git parts like looking at telemetry data, coming up with a design and a spec, and even writing the actual code. Let's jump right in to working with the repository.
The engineer starts by syncing to the latest commit on main
. We keep main
always buildable, so this
is virtually guaranteed to be a good starting point. The developer checks out a new feature branch, makes
code changes, commits, and pushes to the server. When our engineer starts a pull request, several
interesting things happen.
Using Git branch policy
Upon the creation of a pull request, automated systems check that the new code builds, hasn't broken anything, and hasn't violated any policies related to security, compliance, and so on. This doesn't block other work from happening in parallel. Most teams have configured integration with Microsoft Teams, which announces the new pull request to the engineer's colleagues. The owners of any code touched are automatically added as reviewers. We make liberal use of optional reviewers for code that many people touch, like REST client generation and shared controls, as a way to get expert eyes on those changes.
Once the people and the automation are satisfied, our engineer completes the pull request. If there's a merge conflict, the engineer is given instructions on how to sync to the conflict, fix it, and re-push the changes. The automation all runs again on the fixed code, but humans don't have to sign off again.
The branch is merged into main
, and the new code will deploy in the next sprint or major release.
Importantly, that doesn't mean the new feature will show up right away. We decouple the deployment and
exposure of new features using feature flags.
This means even if the feature needs a little more time before it's ready to show off, it's safe to go to
main
if the product builds and deploys. Once in main
, the code ends up in an official build, where
it's again tested, confirmed to meet policy, and digitally signed.
Git facilitates our shift left
Working this way with Git gives us a number of benefits. First, we work out of a single main
, virtually
eliminating merge debt. Second, the pull request flow gives us a common point to force testing, code review,
and error detection early in the pipeline. This helps us shorten the feedback cycle to developers since
errors are usually detected in minutes, not hours or days. Also, it gives us confidence when we refactor
since all changes are tested all the time.
Currently, a product with 200+ pull requests may produce 300+ continuous integration builds per day. Together, that amounts to 500+ test runs every 24 hours, a level that would have been unthinkable without this workflow.
Releases at sprint milestones
At the end of each sprint, we create a deployment branch from the main
branch. For example, at the end
of sprint 129, we create a new branch releases/M129
. We then put the sprint 129 branch into production.
Once we've branched to our deployment branch, the main
branch remains open for developers to merge
changes. These changes, of course, do not get deployed to production. They'll be deployed within the next
three weeks as part of the next sprint deployment.
Releasing hotfixes
Sometimes changes need to go to production more quickly. We generally won't add big new features in the middle of a sprint, but occasionally we want to bring a bug fix in quickly to unblock users. Sometimes they may be minor, such as typos. However, sometimes we have a bug that causes an availability issue, which we call a live site incident.
When this happens, we start with our normal workflow. We create a branch from main
, get it code
reviewed, and complete the pull request to merge it. We always start by making the change in main
first. This allows us to create the fix quickly and validate it locally without having to switch to the
release branch locally.
More importantly, by following this process, we're guaranteed that our change goes into main
. This
is critical for us. If we were to fix a bug in the release branch first, and accidentally forget to bring
the change back to main
, we would have a recurrence of the bug during the next deployment when we
create our sprint 130 release branch from main
.
It's particularly easy to forget to do this during the confusion and stress that can arise during an
outage. So by always bringing our changes to main
first, we know that we'll always have our changes
in both the main
branch and our release branch.
Git functionality enables this workflow. From the Pull Request page, you can cherry-pick a pull request
onto a different branch. To bring changes immediately into production, once we have merged the pull
request into main
, we cherry-pick the change into the release branch. This creates a new pull request
that targets the release branch, backporting the contents that were just merged into main
.
By opening a new pull request, we get traceability and reliability from branch policies. Using cherry-pick functionality allows us to do it quickly. We don't need to download the release branch to a local computer to cherry-pick the changes. It's all handled efficiently on the server. And if we need to make changes, to fix merge conflicts, or make minor changes due to differences between the two branches, we can do that on the server, too. We can edit changes directly from the browser-based text editor or via Pull Request Merge Conflict Extension for a more advanced experience.
Once we have a pull request targeting our release branch, we'll code review it again, evaluate the branch policies, and test it. Once it's merged, it will get deployed to our first ring of servers in minutes. From there, we'll progressively deploy it to more accounts using deployment rings. As more users are exposed to the changes, we'll monitor success and ensure that our change has fixed the bug while not introducing any new deficiencies or slowdowns as the fix is deployed to the rest of our data centers.
Moving on to the next sprint
After three weeks, we'll finish adding features to sprint 130, and we'll be ready to deploy those changes.
To deploy, we'll create the new release branch, releases/M130
from main
, and deploy that.
At this point, we'll actually have two branches in production. Since we use a ring-based deployment to bring changes to production safely, our fast ring will get the sprint 130 changes, while our slow ring servers will stay on sprint 129 while the new changes are validated in production.
This raises an interesting problem. If we need to hotfix a change in the middle of a deployment, we may need to hotfix two different releases: the sprint 129 release and the sprint 130 release. In these cases, we'll port the hotfix to both release branches and deploy both release branches. The 130 branch will redeploy with the hotfix to the rings that have already been upgraded. The 129 branch will redeploy with the hotfix to the outer rings that haven't been upgraded to next sprint's version yet.
Once all the rings have been deployed, our old branch from sprint 129 is completely abandoned. We'll
never need it again, since we were very careful to ensure that any changes that we brought into the
sprint 129 branch as a hotfix was also made in main
. So those changes will also be in the
releases/M130
branch that we create.
Summary
Our release flow model is at the heart of how Microsoft develops with DevOps. It allows us to use a simple, trunk-based branching strategy for our online service. But instead of keeping our developers stuck in a deployment queue, waiting to be able to merge their changes, our developers can keep working.
It also enables us to deploy new features across all our Azure data centers at a regular cadence, and despite the size of our codebases and the number of developers working in them, we can bring hotfixes into production quickly and efficiently.
Feedback
Submit and view feedback for