During Seagate’s recent launch of its Savvio 10K.4 drive, I read some news stories that indicated the 10K.4’s two million hour Mean Time Between Failure (MTBF) specification meant that the drive’s actual average lifespan equated to over 200 years of use. While the 10K.4 certainly is 25% more reliable in its reliability rating compared to its prior generation, unfortunately the MTBF specification itself is often clearly misunderstood. So let’s talk a little bit about what “The Mean” really means and why Annualized Failure Rate (AFR) is a preferred predictor of reliability. Many thanks go to Bill Rudock of Seagate for his insight and help with this week’s blog.
Before we get into the numbers, an important note concerning either MTBF or AFR is that these are used to estimate reliability for a population or group of drives. Neither specification is designed to determine a single individual drive’s useful lifespan.
Historically, the term Mean Time Between Failure (MTBF) was most often used as a reliability description for repairable systems. Very early in the history of hard drives – which now spans over 50 years – drives were actually repairable systems that required frequent field service intervention/maintenance. The mean times between failures were measured in days or weeks. The term has remained in use for describing disc drive reliability since those early days.
Let’s give an example of MTBF use first as it relates to a population of airplanes that are repairable. Consider a commercial fleet of 100 airplanes tracked for one year (= 8,760 hrs). Assume seven airplanes – # 9, #16, #18, #56, #61, #77 and # 94 — each experienced a failure (maybe a component) in the year that required repair. And airplane #41 required two repairs, and # 67 required six repairs in that same year. Neglecting any downtime for which the repairs takes place and assuming otherwise the airplanes are always flying, then the cumulative time the fleet is utilized is: 100 x 8,760 hrs = 876,000 hrs. The total number of failures for the fleet for the year was 7+2+6=15. The MTBF for the fleet =876,000/15=58,400 hrs. Notice that the MTBF hours are longer than a year and yet some individuals experienced failure within a year. Also notice that the MTBF as calculated here does not take into account when any of the individuals failed. Nor does it seem to be an accurate description of the individual airplane #67 that required six repairs in that year.
Disc drive reliability specifications based on MTBF can lead to common misconceptions. For example, Seagate’s previous generation of enterprise-class disc drives have a specified MTBF of 1,600,000 hrs. This is much longer than any single individual’s expected mission life. Yet someone might innocently read that specification and expect all individual drives to last that long. Seagate has migrated to adding an Annualized Failure Rate (AFR) specification to be more clear and precise in reliability descriptions. For disc drives the reliability metrics and specifications (AFR or MTBF) are, necessarily, probabilistic population metrics for groups of drives.
The following is quoted from a prior-generation Seagate Cheetah drive product manual for MTBF:
“… The mean time between failures (MTBF) target is specified as device power-on hours (POH) for all drives in service per failure. The following expression defines MTBF:
“Estimated power-on operating hours means power-on hours per disc drive times the total number of disc drives in service.”
Now, on to AFR. Compare the above with the following quotation from a current generation Product Manual:
“These drives shall achieve an AFR of 0.55% (MTBF of 1,600,000 hours) when operated in an environment that ensures the HDA case temperatures do not exceed the values specified in Section 6.4.1.Operation at case temperatures outside the specifications in Section 6.4.1 may increase the AFR (decrease the MTBF). AFR and MTBF statistics are population statistics that are not relevant to individual units.
AFR and MTBF specifications are based on the following assumptions for Enterprise Storage System environments:
•8,760 power-on hours per year
•250 average on/off cycles per year
•Operating at nominal voltages
•System provides adequate cooling to ensure the case temperatures specified in Section 6.4.1 are not exceeded”
To calculate AFR, we use this formula: AFR = 1 – exp ( – Annual Operating Hours / MTBF)
But even with the knowledge of the formula set aside for a moment, the AFR percentage itself (i.e., .55% in the above example) is itself obviously more easily understood and clear.
The MTBF estimated in the airplane example and implied by the calculation method described in the previous generation Cheetah product manual inherently assumes a statistically constant failure rate. Though commonly used, Seagate finds that such an assumption is not generally true for disc drives and therefore use of the term MTBF can again be confusing. We prefer Annualized Failure Rate (AFR) as a reliability metric as a result but include MTBF in enterprise product literature for historical reference.