FAST ’07, the File and Storage Technology conference, was held from February 13th through the 16th. During the conference, a number of interesting papers were presented, two of which I want to highlight. I learned of these papers through posts on Slashdot rather than actually attending the conference. Honestly, I’m not a storage expert, but I find these studies interesting.
The first study, “Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?” was written by a Carnegie Mellon University professor, Garth Gibson, and a recent PhD graduate, Bianca Schroeder.
This study looked into the manufacturer’s specifications of MTTF, mean time to failure, and AFR, annual failure rate, compared to real-world hard drive replacement rates. The paper is heavily littered with statistical analysis, making it a rough read for some. However, if you can wade through all of the statistics, there is some good information here.
Manufacturers generally list MTTF rates of 1,000,000 to 1,500,000 hours. AFR is calculated by taking the number of hours in a year and dividing it by the MTTF. This means that the AFR ranges from 0.54% to 0.88%. In a nutshell, this means you have a 0.5 to 0.9% chance of your hard drive failing each year.
As explained in the study, determining whether a hard drive has failed or not is problematic at best. Manufacturers report that up to 40% of drives returned as bad are found to have no defects.
The study concludes that real world usage shows a much higher failure rate than that of the published MTTF values. Also, the failure rates between different types of drives such as SCSI, SATA, and FC, are similar. The authors go on to recommend some changes to the standards based on their findings.
The second study, “Failure Trends in a Large Disk Drive Population” was presented by a number of Google researchers, Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andr´e Barroso. This paper is geared towards trying to find trends in the failures. Essentially, the goal is to create a reliable model to predict a drive failure so that the drive can be replaced before essential data is lost.
The researchers used an extensive database of hard drive statistics gathered from the 100,000+ hard drives deployed throughout their infrastructure. Statistics such as utilization, temperature, and a variety of SMART (Self-Monitoring Analysis and Reporting Technology) signals were collected over a five year period.
This study is well written and can be easily understood by non-academicians and those without statistical analysis training. The data is clearly laid out and each parameter studied is clearly explained.
Traditionally, temperature and utilization were pinpointed as the root cause of most failures. However, this study clearly shows a very small correlation between failure rates and these two parameters. In fact, failure rates due to high utilization seemed to be highest for drives under one year old, and stayed within 1% of low utilization drives. It was only at the end of a given drives expected lifetime that the failure rate due to high utilization jumped up again. Temperature was even more of a surprise showing that low temperature drives failed more often than high temperature drives until about the third year of life.
The report basically concludes that a reliable model of failure detection is mostly impossible at this time. The reason for this is that there is no clear indication of a reliable parameter for detecting imminent failure. SMART signals were useful in indicating impending failures and most drives fail within 60 days of the first reported errors. However, 36% of their failed drives reported no errors at all, making SMART a poor overall predictor.
Unfortunately, neither of these studies elaborated on the manufacturer or model of the drives used. This is likely due to professional courtesy and a lack of interest in being sued for defamation of character. While these studies will doubtlessly be useful to those designing large-scale storage networks, manufacturer specific information would be of great help.
For me, I mostly rely on Seagate hard drives. I’ve had very good luck with them, having had only a handful fail on me over the past few years. Maxtor used to be my second choice for drives, but they were acquired by Seagate at the end of 2005. I tend to stay away from Western Digital drives having had several bad experiences with them in the past. In fact, my brother had one of their drives literally catch fire and destroy his computer. IBM has also had some issues in the past, especially with their Deskstar line of drives which many people nicknamed the “Death Star” drive.
With the amount of information currently stored on hard drives today, and the massive amount in the future, hard drive reliability is a concern for many vendors. It should be a concern for end-users as well, although end-users are not likely to take this concern seriously. Overall these two reports are excellent overview of the current state of reliability and the trends seen today. Hopefully drive manufacturers can take these reports and use them to design changes to increase reliability, and to facilitate earlier detection of impending failures.