Microsoft Cloud-Scale Data Center designs
David Gauthier, Director
Data Center Architecture & Design
In a post last month, I shared some new details about how Microsoft focuses on the use of software to engineer resiliency into our global network of cloud-scale data centers. I explained the fundamental shift we've made to viewing data centers through a systems integrator model, where we look at every aspect of the physical and operational environment as a component of an integrated system to drive continual improvement in performance, efficiency and service availability. By sharing the responsibility for service availability - from development to data center - we're breaking down the silos that traditionally separate IT engineering activities, allowing Microsoft to define the data center environment as a converged ecosystem delivering improved reliability of our services and more satisfied customers.
As more and more availability scenarios are solved in software, we've re-examined how data centers are designed and operated. Today, I'd like to share some examples of how software resiliency has influenced some significant changes in Microsoft's production data centers.
Chicago, Illinois (opened: 2009)
In 2007, we started designing our 700,000 square foot Chicago facility - a large portion of which would be dedicated to services running on resilient, cloud-scale software. These services would operate across geographies in a redundant fashion, supporting multi-way, active/active applications that can transfer workloads within the data center or to another facility. This environment would need to scale out rapidly, but more importantly it would need to compartmentalize failures in a standard and predictable way. To satisfy both of these goals, we implemented a containerized modular data center topology.
Containers afforded us a vehicle for deploying unprecedented quantities of servers in a short period of time - each container houses thousands of machines and are installed 'plug and play' in as fast as one day. The servers in these containers are identical, fungible resources that act as resource pool on top of which workloads are placed.
The logical grouping of IT hardware with the supporting infrastructure, allows us to better manage and automate workload allocation across the data center through the application layer of the stack. But to place workloads effectively, you need to have confidence in the relationships between the infrastructures supporting the IT equipment.
In the electrical and mechanical design of this data center, we considered each container as a discrete failure domain and modeled the availability of power and cooling with the expectation that maintenance events and unplanned outages would occur in the environment. While the traditional data center designer focuses on minimizing Mean Time Between Failures (MTBF), we made design choices based on Mean Time to Recover (MTTR) and the impact to workloads during downtime. This led us to being able to remove a number of components from design, resulting in significant savings not only in capital investment, but also in the long term operating costs. The first 'traditional' data center components to go were the dual-cords to the servers and the diesel generators.
Single corded containers. By treating a container (and the thousands of servers inside it) as a discrete failure domain and unit of capacity, we could forego the second power supply and redundant power source that most data centers deliver to every server. If a container is taken offline due to planned maintenance or an unplanned failure, it is completely isolated from the other containers and workloads are transitioned to a functioning container while maintaining customers service level agreements (SLAs).
No generators. Our Chicago data center has a remarkably robust connection to the electrical grid. Through deep analysis of our utility services, we determined the probability of complete loss of electricity to the facility to be extremely low and that the types of events that could affect an outage would be very unlikely. The decision to forego diesel generators for our container bay floor was rooted in our capability to deeply integrate the facility with the software that runs on top of it. In the rare event of total utility outage, the same software that can move a workload from one container to another can sense a broad, facility level loss of utility and will redirect users and traffic to another data center with minimal service impact. In fact, one of these rare events did occur in 2011 as a result of an offsite lightning strike - workloads transferred to another facility as planned and aside from a slight increase in network latency, the applications never missed beat. The utility service returned in line with our MTTR expectations and workloads were transitioned back to Chicago.
In 2010, Microsoft won the prestigious Green Enterprise IT Award from the Uptime Institute for the bold IT initiatives we utilized in Chicago. These integrated new design solutions led to greater efficiencies and carbon waste reductions.
Dublin, Ireland (opened: 2009)
Opening the same year as Chicago, our Dublin data center was designed with the same core set of performance criteria - efficiency in deployment, operation, energy and costs. While Chicago's containers were cooled via waterside economization, the mild year-round temperatures in Dublin allowed us to directly employ outside air as the primary (and now only) means of cooling the servers. The data center was designed as an evolution of a traditional colocation room with rows of factory-integrated server racks deployed into pre-provisioned hot-aisle enclosure shells we call PODs. The PODs allow us simplicity in deployment, while achieving optimized airflow in a facility that breathes.
In the Dublin design, we planned for the rapid pace of evolution in data center technology by enabling phased construction and commissioning of the facility. Each successive phase of construction on this site has resulted in material improvements in energy efficiency, system capacity and cost performance. These efficiencies have primarily been the result of changing the inlet temperature for the servers and the mechanical means we have to trim the temperature. In the first phase data center, we employed adiabatic cooling-which vaporizes water into the air to absorb heat-and an air bypassing feature to improve operational efficiency and maintain constant room temperature regardless of outdoor conditions. Out of an abundance of caution, we also installed a conventional direct expansion (DX) air conditioning system as a secondary means of protection in addition to the outside air cooling. As we got operationally comfortable with outside air, for our second phase expansion, we removed the DX secondary system and relied only on the outside air and the adiabatic cooler. For the third phase, we increased our server inlet temperatures and further reduced the amount of time we used the trimming capability of the adiabatic cooler, resulting in a cooling system that uses less than 1% the water consumed by a traditional data center. In the fourth phase, we mined years of operational runtime data to deeply understand the correlation between temperature and server component failures, an example of this research can be found here. With this information, we were able to further expand our inlet temperature conditions at the server and run the data center with no adiabatic cooling - outside air only and not a drop of water.
These examples of continuous improvement in design and operation are key tenets of the cloud-scale data center - how will the data center keep pace with technological improvement and insight gained over time.
In 2010, the 6th Data Centres Europe Conference presented our Dublin data center team with the prestigious award for being the Best European Enterprise Data Center Facility.
Boydton, Virginia (opened 2011)
The next leap in our cloud-scale infrastructure evolution was to integrate the best efficiency attributes of our Chicago and Dublin facilities. By combining the energy efficient performance of outside air cooling with the capacity and workload deployment agility of containers, we literally effected a transition from inside to outside.
Instead of housing containers or PODs within a conventional building, we designed self-contained, outdoor rated, air cooled data center modules. At Microsoft, we call these Pre-assembled Components or ITPACs. These ITPACs are installed, operated and serviced outside, exposed to rain, sleet, wind and snow. We've applied lessons learned from Dublin and Chicago, equipping these ITPACs with sensors and controls that enable operational runtime and system telemetry integration with our cloud services. Even in the often hot and humid environment of southern Virginia, the ITPACs do not use chillers or DX air conditioning systems, just outside air tempered by adiabatic coolers.
By moving the majority of the data center outdoors, it didn't make much sense to build a building to house the electrical equipment supporting the ITPACs. Instead we organized the ITPACs along a central linear spine that is constructed with a covered breezeway facilitating personnel access while housing all of the electrical equipment delivering megawatts of power to the servers. This further increases the energy efficiency of the integrated system by cooling the electrical equipment naturally without the air conditioners or fans necessary inside a building.
The design and operation of Microsoft's cloud-scale data center footprint evolves over time, driving continuous improvements informed by runtime telemetry from operating one of the largest server and data center footprints on the planet. While each of the designs I've characterized above is a bit different, they all have one thing in common - a deliberate integration between the physical world of data centers and servers and the logical world of software.