Cloud services you can trust: Office 365 availability

Article
08/08/2013

Rjesh Jha Rajesh Jha, Corporate Vice President
Office

This is a guest post from Office Corporate Vice President, Rajesh Jha, cross-posted from the Office blog

Office 365 availability

Since launching Office 365 two years ago, we have continued to invest deeply in our infrastructure to ensure a highly available service. While information has been available in detail for our current customers, today we're making this information available to all customers considering Office 365. We measure availability as the number of minutes that the Office 365 service is available in a calendar month as a percentage of the total number of minutes in that month. We call this measure of availability the uptime number. Within this calculation we include our business, government and education services. The worldwide uptime number for Office 365 for the last four quarters beginning July 2012 and ending June 2013 has been 99.98%, 99.97%, 99.94% and99.97% respectively. Going forward we will disclose uptime numbers on a quarterly basis on the Office 365 Trust Center.

Here are a few more details about the uptime number:

The uptime number includes Exchange, SharePoint, Lync and Office Web Apps, weighted on the number of people using each of these services. Customers use these services together, so all of these are taken into account while calculating uptime.
This uptime number applies to Office 365 for business, education and government. We do not include consumer services in this calculation.
Office 365 ProPlus is an integral part of our service offering but is not included in this calculation of uptime since it largely runs on the users' devices.
Individual customers may experience higher or lower uptime percentages compared to the global uptime numbers depending on location and usage patterns.

As a commitment to running a highly available service, we have a Service Level Agreement of 99.9% that is financially backed.

Availability design principles

We have been building enterprise-class solutions for decades. In addition, Microsoft runs a number of cloud services like Office 365, Windows Azure, CRM Online, Outlook.com, SkyDrive, Bing, Skype and Xbox Live to name a few. We benefit from this diversity of services, leveraging best practices from each service across the others improving both the design of the software as well as operational processes.

Below are some examples of best practices applied in design and operational processes for Office 365.

Redundancy. Redundancy at every layer--physical, data and functional:

We build physical redundancy at the disk/card level within servers, the server level within a datacenter and the service level across geographically separate data centers to protect against failures. Each data center has facilities and power redundancy. We have multiple datacenters serving every region.
To build redundancy at the data level, we constantly replicate data across geographically separate datacenters. Our design goal is to maintain multiple copies of data whether in transit or at rest and failover capabilities to enable rapid recovery.
In addition to the physical and data redundancy, as one of our core strengths we build Office clients to provide functional redundancy to enable you to be productive using offline functionality when there is no network connectivity.

Resiliency. Active load balancing and constant recovery testing across failure domains:

We actively balance load to provide end users the best possible experiences in an automated manner. These mechanisms also dynamically prioritize, performing low priority tasks during low activity periods and deferring them during high load.
We have both automated and manual failover to healthy resources during hardware or software failures and monitoring alerts.
We routinely perform recovery across failure domains to ensure readiness for circumstances require failovers.

Distributed Services. Functionally distributed component services:

The component services in Office 365 like Exchange, SharePoint, Lync and Office Web Apps are functionally distributed, ensuring that the scope and impact of failure in one area is limited to that area alone and not impact others.
We replicate directory data across these component services so that if one service is experiencing an issue, users are able to login and use other services seamlessly.
Our operations and deployment teams benefit from the distributed nature of our service, simplifying all aspects of maintenance and deployment, diagnostics, repair and recovery.

Monitoring. Extensive monitoring, recovery and diagnostic tools:

Our internal monitoring systems continuously monitor the service for any failure and are built to drive automated recovery of the service.
Our systems analyze any deviations in service behavior to alert on-call engineers to take proactive measures.
We also have Outside-In monitoring constantly executing from multiple locations around the world both from trusted third party services (for independent SLA verification) and our own worldwide datacenters to raise alerts.
For diagnostics, we have extensive logging, auditing, and tracing. Granular tracing and monitoring helps us isolate issues to root cause.

Simplification. Reduced complexity drives predictability:

We use standardized components wherever possible. This leads to fewer deployment and issue isolation complexities as well as predictable failures and recovery.
We use standardized process wherever possible. The focus is not only on automation but making sure that critical processes are repeated and repeatable.
We have architected the software components to be loosely coupled so that their deployment and ongoing health don't require complex orchestration.
Our change management goes through progressive, staged, instrumented rings of scope and validation before being deployed worldwide.

Human back-up. 24/7 on-call support:

While we have automated recovery actions where possible, we also have a team of on-call professionals standing by 24x7 to support you. This team includes support engineers, product developers, program managers, product managers and senior leadership.
With an entire team on call, we have the ability to provide rapid response and information collection towards problem resolution.
Our on-call professionals while providing back-up, also improve the automated systems every time they are called to help.

Continuous learning

We understand that there will be times when you may experience service interruptions. We do a thorough post-incident review every time an incident occurs regardless of the magnitude of impact. A post-incident review consists of an analysis of what happened, how we responded and how we prevent similar incidents in the future. In the interest of transparency and accountability, we share post-incident review for any major service incidents if your organization was affected. As a large enterprise, we also "eat our own dogfood," i.e., use our own pre-production service to conduct day-to-day business here at Microsoft. Continuous improvement is a key component to provide a highly available, world-class service.

Consistent communication

Transparency requires consistent communication, especially when you are using online productivity services to conduct your business. We have a number of communication channels such as email, RSS feeds and the Service Health Dashboard. As an Office 365 customer, you get a detailed view into the availability of services that are relevant to your organization. The Office 365 Service Health Dashboard is your window into the current status of your services and your licenses. We continue to drive improvements into the Service Health Dashboard including tracking timeliness of updates to ensure so that you have full insight into your services health.

We also have some exciting new tools to improve your ability to stay up to date with the service. Last week we released a new feature in the administration portal called "Message Center." Message Center is a central hub for service communications, tenant reporting and actions required by administrators. Also, by the end of this year, administrators can expect a new mobile app that will provide service health information as well as other communications regarding their service.

Running a comprehensive and evolving service at ever increasing scale is a challenge and there will be service interruptions despite our efforts. We want to assure you that we are continually learning and are relentless in our commitment to provide you with a reliable highly available service that meets your expectations. Service continuity is more than an engineering principle it is a commitment to customers in our SLA and as one of the key pillars of Office 365 Trust Center (the other four pillars being Privacy, Security, Compliance and Transparency). This public disclosure of Office 365 uptime is evidence of our ongoing commitment to both Service Continuity and Transparency.

- Rajesh Jha, Corporate Vice President, Office