The Petabyte Challenge…
Managing large amounts of data is very challenging technical endeavor, but for many cloud storage providers the volume of data they are tasked with protecting and maintaining is so large that it presents unique challenges.
One of those particularly daunting issues is “silent data corruption.” This topic is discussed by Zetta CTO Jeff Whitehead in a recent blog entry. Whitehead’s excellent description of the problem and how to analyze it includes a calculator to help estimate the probability of random disk failures – this should be required reading for any system administrator or architect of a cloud storage solution. If you’ve built one and this is news to you – you (and your customers) are in trouble…
We included a large excerpt below (Jeff: let us know if you would prefer we take it down):
IT professionals are well aware of many challenges related to scaling storage: capital required to house data, manage backups, data center space, power and cooling. One area many IT professionals haven’t had time to look at, however, is how increasing data footprints translate into increased risk of data loss or data corruption. To put this in context, IDC recently reported that data volumes will increase by a “factor of almost five,” while “total IT budgets worldwide will only grow by a factor of 1.2 and IT staff by a factor of 1.1.” In this context of constraints, being asked to do more with less, without special attention to data risk management, risk inevitably increases.
I believe that many IT professionals and CIO’s will be very surprised to see that while Data Loss (ie, simultaneous drive failures) may not be very probable, Data Corruption (the data on disk is no longer what was originally written out by the application) is shockingly likely, and has caused outages for even some of the most technologically advanced high end environments.
The objective of this blog is to introduce or reintroduce the concept of “Mean Time To Data Loss (MTTDL),” whereby IT professionals, CIOs, and risk managers can create a probabilistic model for evaluating the reliability and probability of data loss for your current environment, and also compare and contrast with how Zetta is advancing the state of the art for cost effective data protection.
MTTDL is a tool, and to be effective one must understand its limitations. The inputs to the model are as follows:
The number of hard drives (data set size/system performance)
The reliability of each hard drive
The probability of reading a given hard drive correctly without error (see prior blog about silent data corruption)
The redundancy encoding of the system
The rebuild rate.
Mean Time to Data Loss is in many respects a best case scenario, because it ignores risks to data integrity such as fire, natural disaster, human error, and other common causes of storage failures. It also ignores autocorrelation¸ or drives failing at the same time due to similar workload, similar manufacturing batches, firmware issues, or the like. Despite these limitations, MTTDL is still one of the better tools for evaluating the data protection features of a storage system.