How Microsoft Operates One of the World’s Largest Cloud Infrastructures

Data Center Services Team

Microsoft has delivered online services since 1994 with the launch of MSN.  Today, our engineering and operations teams build, manage and secure a$15 billion cloud-scale infrastructure that powers over 200 services for consumers and businesses 24x7x365 via globally distributed data centers, networks, and security mechanisms, software applications and tools. Our cloud supports more than 1 billion customers and 20 million businesses in 76 markets. With an annual investment of $9.5 billion in research and development, we are enabled to continue to accelerate the innovations in our operational processes to maximize reliability, performance, and efficiency as our capacity grows.  


Each year thousands of enterprise customers and partners regularly request meetings with our Global Foundation Services team to learn how we manage Microsoft's cloud infrastructure. To help explain our approach, today our team is sharing a new strategy paper called Cloud Operations Excellence & Reliability along with two new videos that address the top issues we focus on to keep our sites up, costs down, and accelerate efficiency and sustainability across our large portfolio


In the two new videos, Dayne Sampson, corporate vice president provides an overview of Microsoft's approach to infrastructure operations and Rick Bakken, Senior Director, walks through the company's operational approach in a presentation-style video.  

The strategy paper is written by our data center services' subject matter experts and offers details behind the processes and procedures Microsoft uses to maintain 24x7x365 availability for its cloud.  Here are few highlights:


  • Site up, costs down:  Our infrastructure team keeps a relentless focus on these two themes in the planning, management, and growth of its operations. While software has taken a central role in helping maintain service availability, and keeping our cloud up and running, we anticipate and prepare for issues to occur. This preparation and understanding helps us compartmentalize and address failures more quickly. To keep costs down, we only add new data center capacity when and where needed according to customer demand. And when adding new capacity, we're continually learning and improving our data center designs, enabling us to bring new capacity online faster. 
  • Network reliability: Having a reliable network is critical to Microsoft's online services.  To deliver this, we operate one of the world's largest fiber networks, providing over 3.5 terabits per second of capacity to over 2000 networks with 99.95 percent availability. This network provides multiple paths to many providers, allowing instantaneous re-routes around internet failures to maintain high reliability. It has sufficient capacity to maintain performance during large-scale network interruptions.    
  • Microsoft Operations Centers: Quick response is critical in ensuring service delivery to our customers. Through multiple global Microsoft Operations Centers working around the clock, our teams of engineers work to ensure continual online services availability. Using an incident management model, staff in these centers can quickly identify issues and resolutions, while also driving engagement for higher severity issues. The model establishes clear accountabilities and describes parameters for handling incidents and outages.

With new services and applications launching from Microsoft regularly, we're making thoughtful investments to meet customer's needs for greater availability, improved performance, increased security, and lower costs.  Through the sharing of best practices via interactive videos, white papers, and strategy papers, we hope they will increase the dialogue within the industry as we all work to advance cloud operational reliability. Please visit our operational excellence page to explore all the new content. 


Data Center Services Team