Explore Microsoft’s Enterprise Resilience and Crisis Management (ERCM) program

Completed

Resilience is a critical component of Microsoft service availability, but even resilient services can be impacted by unexpected events. Microsoft's Enterprise Resilience and Crisis Management (ERCM) Program helps to ensure our online services are prepared to recover quickly from unexpected events.

ERCM team structure

The ERCM Program Office provides governance, oversight, and support for business continuity management (BCM) across Microsoft. The related Business Continuity Council of Microsoft senior management representatives is chartered to drive business continuity sponsorship, awareness, resource allocation, and program accountability across their respective business units. Together, these teams drive compliance with the Microsoft business continuity management (BCM) framework across the enterprise.

Each Microsoft business unit is required to comply with the objectives of the Microsoft ERCM program. To support ERCM objectives, each business unit designates a representative, or Business Continuity Lead (BCL), to lead and coordinate ERCM implementation activities within their business unit. This representative serves as the primary point of contact for all continuity and resilience issues. Most BCLs rely on a team of individuals to help execute continuity programs within their business unit. These individuals are often referred to as Champs, Subject Matter Experts (SMEs), or Program Managers.

The ERCM Program Office maintains a database of all online services, including upstream and downstream dependencies, which serve as a central repository for business continuity information across our online services. It also records all relevant documentation, reviews, and testing dates. Service teams are automatically notified when their ERCM documentation or processes need to be updated or tested.

The ERCM Program Office and individual service BCLs work with Microsoft Enterprise Governance Risk & Compliance (EGRC) to highlight any enterprise-level risks that are identified as part of annual plan testing and review. Risks highlighted in this manner are assigned a risk rating, an owner to drive remediation, and are tracked until resolved. ERCM coordination takes a One Microsoft approach and results in a tightly knit relationship among partner teams. The following list outlines the teams involved in ERCM activities related to Microsoft 365:

  • Enterprise Governance Risk & Compliance (EGRC) – Team responsible for enterprise-level reporting on risk/compliance and finding current Information Security Standards that align with the Microsoft Security Policy, implementation procedures, and recognized industry standards. Manages overall risk for Microsoft, including risks associated with ERCM.
  • ERCM Program Office – Team responsible for managing the Microsoft ERCM program, including resilience standards, policy, training, and metrics.
  • Business Continuity Council – Senior representatives from each business/engineering function, such as Microsoft 365, who collaborate on cross-group plans and overall policy.
  • Business Continuity Leads – Individuals from each service that lead continuity and resilience efforts within their business unit (for example, Azure, Microsoft 365, Dynamics).
  • Business Continuity Champions – Individuals from each service team, such as Exchange Online or Microsoft Teams, who lead Business Continuity and Disaster Recovery (BCDR) efforts for their service team (for example, Azure Blob Storage, Exchange, Power BI).
  • Workload DevOps – Engineers within service teams that are responsible for feature development, day-to-day operations, and supporting live site issues including BCDR responsibilities (for example, incident managers, on-call engineers, DevOps teams).
  • Microsoft 365 Incident Communication and Coordination – Microsoft 365 team that functions as a central hub for internal and external communication during an incident of Microsoft 365 Services, responsible for customer notification of service-impacting incidents via the Microsoft 365 Service Health Dashboard and other communications platforms.
  • Customer Service and Support – Team responsible for handling customer-reported issues. Serves as a first point of contact for customers in the event of a disaster.

BCM framework

In addition to facilitating cooperation on business continuity, Microsoft's ERCM program provides a consistent BCM framework that is implemented by business units across the enterprise. This framework addresses the recovery and continuity of critical business functions, services, and data required to maintain an acceptable level of operations during an incident. Use of a common framework ensures the existence of effective, reliable, well-tested plans, systems, and processes that can be counted on to support business continuity and minimize adverse impacts during a disruptive event.

Diagram that shows how the ERCM program works with Microsoft Business Units. ERCM program is responsible for governance, compliance, and guidance. Microsoft Business Units are responsible for following ERCM methodology and policy and collaborate with ERCM program in different aspects.

BCM lifecycle

The BCM lifecycle is at the core of our BCM methodology. This process is designed to be adaptable so it can be implemented by a wide variety of business models across Microsoft. The phases of the BCM lifecycle guide each business unit at Microsoft through developing and implementing effective business continuity and resilience plans.

The BCM lifecycle includes three high-level phases. It begins with an initial Assessment, which involves identifying critical processes and objectives that should be included in the business continuity program. The Planning phase focuses on developing and implementing resilience and recovery strategies, as well as documenting them in official business continuity plans. Finally, Capability Validation tests business continuity plans and their implementations to verify effectiveness and identify improvements.

A diagram of the BCM lifecycle - assessment, planning, and capacity validation.

Learn more