Reliability metrics

Completed

Sometimes, when you read material that references availability and reliability, you'll see the term nines. Five nines or nine nines refers to the number of nines in a percentage of availability. Two nines is 99%, three nines is 99.9%, four is 99.99%, and so on.

Mean time between failures

You'll also see the phrases mean time between failures (MTBF) and mean time to failure (MTTF) in the specifications for many individual components (for example, hard drives, motherboards, power supplies). These are defined as the average number of hours that component is expected to last and are usually determined by the manufacturer using a sample of parts in more extreme conditions. However, reported failure rates in the field often are higher. For example, hard drives are rated at 1 million hours or more, but their failure rates have been found to be 2 to 10 times higher.1 Google found drive failure rates to be 50% higher on average in their study.2 The failure rate is 1 / MTBF. For example, if the MTBF of a device is 100 hours, then the chances of that device failing in 1 hour is 1 / 100, 0.01, or 1%.

It's important to note that in determining the overall MTBF of a system that has non-redundant components, the MTBF of each individual component is added as a reciprocal. Formally:

$$ \frac{1}{MTBF_{system}} = \left(\frac{1}{MTBF_{c1}} + \frac{1}{MTBF_{c2}} + \cdots + \frac{1}{MTBF_{cn}} \right) $$

On the other hand, when a system consists of redundant components, failure is required in both components simultaneously to have an overall system failure. The overall MTBF of the system is thus the product of the MTBF of each individual redundant component of the system. Formally:

$$ MTBF_{system} = MTBF_{rc1} \times MTBF_{rc2} \times \cdots \times MTBF_{rcn} $$

One factor that's often overlooked in considering uptime is human error. No matter how much redundancy is designed into the system, even if it is properly implemented and maintained, there is some likelihood of a mistake being made by a person. The result of that eventually leads to a service being unavailable (downtime). Some mistakes can be prevented through policy, specifying standard configurations, good documentation, and change management.

When it comes to large cloud deployments, there is little focus on the hardware resiliency of an individual server. When 10,000 or more servers are working together as part of a single application, the application itself builds in the fault tolerance. In this situation, a single server failure, or even several, will not disrupt the application/service. Small and medium-sized businesses, or even a large enterprise that has legacy applications, cannot afford to author these cloud-style, fully customized applications, so they rely on third-party software, most of which does not respond well to hardware failures. Instead, cloud providers will focus on server hardware that is inexpensive and as energy efficient as possible, removing unneeded parts.


References

  1. Schroeder, Bianca, and Gibson, Garth A. (2007). Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In Proceedings of the 5th USENIX Conference on File and Storage Technologies
  2. Eduardo Pinheiro, Weber, Wolf-Dietrich, and Barroso, Luiz André. (2007). Failure Trends in a Large Disk Drive Population In Proceedings of the 5th USENIX Conference on File and Storage Technologies

Check your knowledge

1.

Assume you have 20,000 independent hard drives of a particular model in your datacenter, each with a manufacturer-specified MTBF of 1 million hours. Assume you do not trust the manufacturer-specified MTBF, so divide by two to get 500,000 hours. For the second year in the life span of those drives, how many of the 20,000 would you expect to fail?

2.

Consider the same scenario from the previous question. If each drive is part of a two-drive RAID 1 mirror, would you expect to lose any data from a double drive failure on any one of those 10,000 RAID 1 arrays during that year?
(Also assume that a failed drive is replaced immediately, and no additional drives fail during rebuild.)