Shift right to test in production

One of the most effective ways DevOps teams can improve velocity is by shifting their quality goals left. In this sense, they are pushing aspects of testing earlier in the pipeline in order to ultimately reduce the amount of time it takes for new code investments to reach production and operate reliably.

There's no place like production

While there are many kinds of tests that can be easily shifted left, such as unit tests, there are a whole class of tests that simply cannot happen without part or all of a solution being deployed. While deploying to a QA or staging service can simulate an environment comparable to production, there really is no substitute.

The full breadth and diversity of the production environment is hard to replicate in a lab. The real workload of customer traffic is also hard to simulate. And even if tests are built and optimized, it becomes a significant responsibility to maintain those profiles and behaviors as the production demand evolves over time.

Moreover, the production environment keeps changing. It's never constant and, even if your app doesn't change, everything underneath it is constantly changing. The infrastructure it relies on keeps changing. So over a period of time, teams find that certain types of testing just needs to happen in production.

What is Testing in Production?

Testing in production is the practice of using real deployments to validate and measure an application's behavior and performance in the production environment. It serves two important purposes:

  • It validates the quality of a given production deployment.
  • It validates the health and quality of the constantly changing production environment.

Validating a deployment

To safeguard the production environment, it's necessary to roll out changes in a progressive and controlled manner. This is typically done via the ring model of deployments and with feature flags.

The first ring should be the smallest size necessary to run the standard integration suite. These tests may be similar to those already run earlier in the pipeline against other environments, but it's important to run them again here to validate that the behavior in the production environment isn't different from the others. This is where obvious errors, such as misconfigurations, will be discovered before any customers are impacted.

Once the initial ring is validated, the next ring can broaden to include a subset of real users. The usage of the new production services by real customers becomes the test run. When balanced properly, the value of detected failures exceeds the net losses of those failures, which will need to be measured in a way meaningful to a given business. For example, a bug that prevents a shopper from completing their purchase is very bad, so it would always be better to catch that issue when less than 1% of customers are on that ring, as opposed to a different model where all customers were switched at once.

If everything looks good so far, the deployment can progress through further rings and tests until it's used by everyone. However, full deployment doesn't mean that testing is over; tracking telemetry is crticially important for testing in production. It's arguably the highest quality test data because it's literally the test results of the real customer workload. It tracks failures, exceptions, performance metrics, security events, etc. The telemetry also helps detect anomalies.

Fault injection and chaos engineering

Teams often employ fault injection and chaos engineering to see how a system behaves under failure conditions. This helps to validate that the resiliency mechanisms implemented actually work. It also helps to validate that a failure starting in one subsystem is contained within that subsystem and doesn't cascade to produce a major outage for the entire product. Without fault injection, it is difficult to prove that repair work implemented from a prior incident would have the intended effect until another incident occurs. Fault injection also helps create more realistic training drills for live site engineers so that they can be better prepared to deal with real incidents.

Fault testing with a circuit breaker

A circuit breaker is a mechanism that cuts off a given component from a larger system. It's usually employed to avoid having failures in that component spread outside its boundaries. However, circuit breakers can be intentionally triggered in order to test how the system responds.

Circuit breakers can be intentionally triggered to evaluate two important scenarios:

  1. When the circuit breaker opens, does the fallback work? It may work with unit tests, but there's no way to know for sure that it will behave as expected in production without injecting a fault to trigger it.

  2. Does the circuit breaker open when it needs to? Does it have the right sensitivity threshold configured? Fault injection may force latency and/or disconnect dependencies in order to observe breaker responsiveness. In addition to evaluating that the right behavior is occurring, it's important to determine whether it happens quickly enough.

Example: Testing a circuit breaker around Redis Cache

Consider a scenario that take a non-critical dependency on Redis. It's a distributed cache, which means that it's really just intended to improve product performance by speeding up access to commonly used data. If it goes down, the system should continue to work as designed because it can fall back to using the original data source for all requests.

To confirm that a Redis failure will trigger the circuit breaker in production, the team should occassionally test the hypothesis by running tests against it.

Redis testing in production

In the picture above there are three ATs, with the breaker in front of the call to Redis. The goal here is to make sure that when the breaker opens, calls ultimately go to SQL. The test forces the circuit breaker to open through a config change and then observe whether the calls go to SQL. Another test then checks the opposite config change by closing the circuit breaker to confirm that calls return back to Redis.

This test validates that fallback behavior works when the breaker opens but doesn't validate the configuration of the circuit breaker settings. Will the breaker open when it needs to? To test that question there is a need to simulate actual failures.

Redis testing with fault injection

This is where the fault injection comes into play. A fault agent can introduce faults in calls going to Redis. In this case, the fault injector blocks Redis requests, so the circuit breaker opens. The test can then observe that fallback works like earlier. When the fault is removed, the circuit breaker sends a test request to Redis. If that passes, the call reverts back to Redis. The next step would be to test the sensitivity of the breaker, whether the threshold is too high or too low, whether there are other system timeouts that interfere with the circuit breaker behavior, and so on.

In this example, if the breaker did not open or close as expected, it may result in a live site incident. Without the fault injection testing, the circuit breaker would remain untested as it's hard to do this type of testing in a lab environment.

Fault injection tips

Chaos engineering can be an effective tool, but it should be limited to canary environments. For example, it should only be used against environments that have little or no customer impact.

It's a good practice to automate fault injection experiments because they are expensive tests and the system is always changing.

Business Continuity and Disaster Recovery (BCDR)

The other form of fault testing is failover testing. Teams should have failover plans for all services as well as subsystems. The plan includes should cover several topics:

  1. A clear explanation of the business impact of the service going down.
  2. A map all the dependencies in terms of platform, tech, and people devising the BCDR plans.
  3. Formal documentation of the disaster recovery procedures.
  4. A cadence to regularly execute the DR drills.

Microservice compatibility

As projects grow, they may find themselves with a large number of microservices that are deployed and managed independently. Shifting right is really important here as there are many ways different versions and configurations will find their way to production. Regardless of how many pre-production test layers are in place, testing compatibility in production becomes a necessity.

Next steps

Releasing to production is just half the job. The other half is ensuring quality at scale with a real workload. There really is no place like production for that. The environment keeps changing, so the team is never done with testing in production. This includes monitoring, fault injection, failover testing, and all other forms.

Learn more about monitoring.