Operational Excellence is Essential for a Reliable Cloud Infrastructure

Christian BeladyVijay GillAttributed to: Christian Belady, General Manager of Data Center Services & Vijay Gill, Senior Director of Network Engineering

When Microsoft delivered its first online service back in 1994, the cloud was still a relatively undefined, evolving space. Today, Microsoft's cloud supports more than 1 billion customers and 20 million businesses in 76 countries and it's still growing. To support the increasing array of enterprise-level online services, Microsoft is investing heavily in a global cloud infrastructure that delivers security, reliability, scalability, efficiency, and performance. A key concern among customers is that their services are persistently available, and that their cloud provider has the technology, policies, and processes in place to maximize uptime. 

Disruptive events such as natural disasters, power grid outages, and human error are a fact of life, so it is critical that our cloud infrastructure is architected for resiliency, our processes result in rapid problem resolution with minimal customer impact, and we maintain a culture of service diligence and continuous innovation to identify and eliminate systemic weaknesses.

Laser-sharp focus on operational excellence helps Microsoft ensure reliability, availability, and security 24x7x365.  Operational excellence can be complicated, and our customers often want to know what it encompasses.

An effective cloud infrastructure strategy pivots off of two things: keeping our sites up, while driving costs down. We architect a resilient cloud infrastructure with geo-redundant back-up for recoverability and data integrity. The company is making a heavy annual investment in over $9 billion in research and development on our cloud, services, and software to ensure a strong innovation roadmap.

The Microsoft cloud is also founded on robust security policies and procedures that differentiate data (consumer vs. enterprise), separating private company information from commercial use. We maintain a fully developed Service Management framework called the Microsoft Operations Framework which guides and formalizes our compliance with international standards such as ISO20000 for Service Management, ISO 27001 for Information Security, and BS 25999 for Business Continuity. In addition, we hold third party, process-based attestations and certifications including SOC (SSAE 16/ISAE 3402), FISMA, Sarbanes-Oxley, and HIIPA/HITECH.

Our continuous growth reflects judicious data center designs that add capacity and reduce attendant costs to maximize compute capacity with the fewest servers while maintaining workload isolation. We partner with leading hardware manufacturers to right-size the thousands of servers, routers, and equipment within them to balance the degree of physical device fault tolerance with the application fault tolerance.

We find the right balance of hardware and software resiliency to provide the best availability for the lowest possible cost.  For Example, our network-one of the largest and most well-connected in the world-includes multiple, physically diverse connections into our data centers, and we maintain sufficient capacity to handle large scale network interruptions without degradation of performance.

Sophisticated service instrumentation and monitoring integrates at the deepest levels with each component, giving us visibility into the data center, network backbone, internet exchanges and beyond, to help us spot, diagnose and manage the cause of any disruption and resolve it quickly.

Technical troubleshooting expertise from our globally distributed Microsoft Operations Center's team provides round the clock staffing, failover capabilities and the resources needed to triage, mitigate and escalate issues as they unfold in real time. The way that we manage releases and changes through formalized processes helps to remove the potential for human error, leverages standardization and protects data in a vigilant manner.   


One of the world's largest fiber backbones powers our data centers, providing more than 3.5 terabits per second of capacity to more than 1200 networks with robust availability. It provides the ability to instantaneously reroute around internet failures and the capacity to withstand significant network interruptions.

Finally, consideration of environmental issues and a relentless focus on energy efficiency is a final component of Operational Excellence. The Microsoft data center operations team continually seeks to drive more efficient power use and cooling in our data centers. Our latest modular data centers use about 50 % less energy than those from three years ago.  Our data center designs - pioneering the use of fresh air cooling and ultra-efficient water use in the latest modular facilities, use only 1 % of the water used by traditional data centers in the industry.  They also use recyclable materials in construction and build in sustainability measures that ensure we are constantly gaining in efficiency and sustainability.  (In fact, we were honored that the United Nations Environment Programme relied on our modular data center designs for the IT infrastructure at their new carbon neutral headquarters complex.)

We also drive efficiency and sustainability in how we operate our data centers. Microsoft's are among the most monitored and measured data centers in the world, informing more efficient operations and identifying areas for future research. At the request of the US Environmental Protection Agency, we shared how we do this at their industry stakeholders meeting on efficiency as an example of how to do it right. Also, by eliminating unnecessary components, using higher efficiency power supplies and voltage converters, and binding the expandability of server platforms, we achieve significant power savings. We look at specific measures such as processor performance per-dollar, per-watt to determine optimum tradeoffs in processor selection.  We also operate our servers at a wider temperature range and use free air cooling and water economization to improve efficiencies. In San Antonio we have been using recycled waste water since 2008 to cool the facility.

With all these opportunities for efficiency, we were not surprised that when we provided data to Accenture and WSP Environmental for a study on the environmental impact of cloud computing, they found that when organizations move their Microsoft business applications to our cloud, there's a net energy savings per user of at least 30 %. For small businesses the result can be even more dramatic, with potential carbon savings of up to 90 % per user.

As part of our commitment to driving not only greater efficiency, but also reducing the impact of our data centers for the long run, later this week we will be sharing some insights into an exciting alternative- energy research project we have been working on.  While we have, and continue, to purchase green energy at many of our locations around the world, we have a fundamental belief that there is a potential to redefine how we and the industry power our data centers.  

A few years ago, we spoke about the concept of looking at data as a form of energy and the concept of data plants. Last July in a blog called The Disappearing Data Center we expanded the concept further.  Through massive integration of power plants and data centers, we felt that there could be huge efficiency gains by eliminating the need for transmission lines, substations, and transformers (as well as the associated transmission losses) that we see in today's power distribution ecosystem. With these data plants we distribute data (an energy form) in a network (an optical grid) providing the next generation of energy distribution. Looking at it this way, we are essentially taking another step in the evolution of refining the energy being distributed. The point is that if we look at data as a form of energy, how does that change what we are doing today? When we talk about the disappearing data center, what we really mean is that it will disappear as we know it through integration and drive unprecedented levels of efficiency gain…. and this can only be done at the scale of the cloud. The figure below shows this evolution taking sustainability to new levels with other side benefits in terms of reliability and the ability of using waste gases.

The evolution of data into energy.

We are committed to the long term sustainability of our industry and believe this is where the industry should focus rather than short term manipulation of carbon PPAs. We have already invested millions in research in this area and have filed many patents in this area, including one on the Data Plant. We will provide more details later this week on these innovations via our team blog here.

As we continue to experiment and share our findings, we hope that we can accelerate ours, and our industry's continued move towards increasing positive computational impact while reducing operational impact.  

We have come a long way since we built our first data center in 1989 (on our Redmond, WA campus) and launched our first cloud service MSN in 1994.  We continue to make positive strides toward even greater reliability, availability, security, and environmental sustainability every day.  Our team is passionately committed to not only improving our own cloud infrastructure designs and the operational practices of our facilities, but also sharing those with other stakeholders in the data center industry. 

//cb and vg