Chaos engineering is a methodology that helps developers attain consistent reliability by hardening services against failures in production. A common way to introduce chaos is to deliberately inject faults that cause system components to fail. The goal is to observe, monitor, respond to, and improve your system's reliability under adverse circumstances.
It's difficult to simulate the characteristics of a service's behavior at scale outside a production environment. The transient nature of cloud platforms can exacerbate this difficulty. Architecting your service to expect failure is a core approach to creating a modern service. Chaos engineering embraces the uncertainty of the production environment and strives to anticipate rare, unpredictable, and disruptive outcomes, so that you can minimize any potential impact on your customers.
Chaos engineering is aimed at increasing your service’s resiliency and its ability to react to failures. By conducting experiments in a controlled environment, you can identify issues that are likely to arise during development and deployment. During this process, be vigilant in adopting the following guidelines:
- Be proactive.
- Embrace failure.
- Break the system.
- Identify and address single points of failure early.
- Install guardrails and graceful mitigations.
- Minimize the blast radius.
- Build immunity.
Chaos engineering should be an integral part of development team culture and an ongoing practice, not a short-term tactical effort in response to a single outage.
Development team members are partners in the process. They must be equipped with the resources to triage issues, implement the testability that's required for fault injection, and drive the necessary product changes.
When to apply chaos
Ideally, you should apply chaos principles continuously. There's constant change in the environments in which software and hardware run, so monitoring the changes is key. By constantly applying stress or faults on components, you can help expose issues early, before small problems are compounded by a number of other factors.
Apply chaos engineering principles when you're:
- Deploying new code.
- Adding dependencies.
- Observing changes in usage patterns.
- Mitigating problems.
Chaos engineering requires specialized expertise, technology, and practices. As with security and performance teams, the model of a central team supporting the service teams is a common, effective approach.
If you plan to practice the simulated handling of potentially catastrophic scenarios under controlled conditions, here's a simplified way to organize your teams:
- Familiarize team members with monitoring tools.
- Recognize outage patterns.
- Learn how to assess the impact.
- Determine the root cause and mitigate accordingly.
- Practice log analysis.
- Start with a hypothesis.
- Measure baseline behavior.
- Inject a fault or faults.
- Monitor the resulting behavior.
- Document the process and observations.
- Identify and act on the result.
Periodically validate your process, architecture choices, and code. By conducting fault-injection experiments, you can confirm that monitoring is in place and alerts are set up, the directly responsible individual (DRI) process is effective, and your documentation and investigation processes are up to date. Keep in mind a few key considerations:
- Challenge system assumptions.
- Validate change (topology, platform, resources).
- Use service-level agreement (SLA) buffers.
- Use live-site outages as opportunities.
Shift-left testing means experiment early, experiment often. Incorporate fault-injection configurations and create resiliency-validation gates during the development stages and in the deployment pipeline.
Shift-right testing means that you verify that the service is resilient where it counts in a pre-production or production environment with actual customer load. Adopt a proactive approach as opposed to reacting to failures. Be a part of determining and controlling requirements for the blast radius.
Stop the experiment when it goes beyond scope. Unknown results are an expected outcome of chaos experiments. Strive to achieve balance between collecting substantial result data and affecting as few production users as possible. For an example of this principle in practice, see the Bulkhead pattern article.
Error budget testing
Establish an error budget as an investment in chaos and fault injection. Your error budget is the difference between achieving 100% of the service-level objective (SLO) and achieving the agreed-upon SLO.
The following sections discuss additional considerations about chaos engineering, based on its application inside Azure.
Identify faults that are relevant to the development team
Work closely with the development teams to ensure the relevance of the injected failures. Use past incidents or issues as a guide. Examine dependencies and evaluate the results when those dependencies are removed.
An external team can't hypothesize faults for your team. A study of failures from an artificial source might be relevant to your team's purposes, but the effort must be justified.
Inject faults in a way that accurately reflects production failures
Simulate production failures. Treat injected faults in the same way that you would treat production-level faults. Enforcing a tighter limit on the blast radius will enable you to simulate a production environment. Each fault-injection effort must be accompanied by tooling that's designed to inject the types of faults that are relevant to your team's scenarios. Here are two basic ways:
- Inject faults in a non-production environment, such as Canary or Test In Production (TIP).
- Partition the production service or environment.
Halt all faults and roll back the state to its last-known good configuration if the state seems severe.
Build confidence incrementally
Start by hardening the core, and then expand out in layers. At each point, lock in progress with automated regression tests. Each team should have a long-term strategy based on a progression that makes sense for the team's circumstances.
By applying the shift left strategy, you can help ensure that any obstacles to developer usage are removed early and the testing results are actionable.
The process must be very low tax. That is, the process must make it easy for developers to understand what happened and to fix the issues. The effort must fit easily into their normal workflow, not burden them with one-off special activities.
The following table lists faults that you can apply to inject chaos. The list represents commonly injected faults and isn't intended to be exhaustive.
Read / Write Latency
Application layer (HTTP)
Alter response codes (HTTP)
Packet reorder / loss (TCP/UDP)
# of connections (active / passive)
Domain / IP / Subnet
Port / Protocol
DNS Host Name resolution
Stop / Kill
Move primary node
Intercept / Denylist calls
Intercept / Re-route calls
No access to service code