Odd failures are too be expected… what is your plan?
Sunday, July 27th, 2008Last weekend that Amazon Web Services Simple Storage Service (S3) system was disabled for a number of hours. The outage affected a huge number of startups and established companies that rely upon S3 for the operation of their systems.
Systems fail… they always do. This isn’t a screed declaring that cloud services are unreliable or that Amazon ought to be excoriated for being fallible. Failures happen to all systems. Those who strive for faultless operation are often the most disappointed when the inevitable occurs. When customers are on their back they can often lean quite heavily on their partners to help them find answers.
The public results of the root cause analysis have been posted. The short of it:
…we found that there were a handful of [system status] messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect.
The details are a bit terse but one can surmise that the monitoring and failure analysis mechanisms used to manage the system were the mechanism that caused the failure. As they put it “…when the corruption occurred, we didn’t detect it and it spread throughout the system causing the [failure]…”
This reminded me of the 1990 AT&T Network “crash”. The entry of one system into a failure mode in fact propagated the failure across the entire network.
It’s one thing to know the routine operational state of a software system. There are litterally thousands (more likely millions) of people across the globe that understand the routine operation of the Linux kernel and the associated protocols, userspace daemons, and applications to a high degree of competency. I count myself as one of them for a number of subsystems.
I have a high respect for the engineers I’ve worked with at MontaVista and elsewhere who understand their codebases so well that they can litteraly imagine the consequences of unexpected situations that real code finds itself in and the correlate symptoms into root causes.
When the unexpected inevitably occurs you’ll want one of these people available to you. I’ve seen them make a difference. What’s your plan?
Brad


