한국어  |  日本語
Your browser either does not support Javascript or you have it disabled. Please enable Javascript to be able to navigate our site and utilize features.

Montavista


Archive for the 'evilbugs' Category

Odd failures are too be expected… what is your plan?

Sunday, July 27th, 2008

Last weekend that Amazon Web Services Simple Storage Service (S3) system was disabled for a number of hours. The outage affected a huge number of startups and established companies that rely upon S3 for the operation of their systems.

Systems fail… they always do. This isn’t a screed declaring that cloud services are unreliable or that Amazon ought to be excoriated for being fallible. Failures happen to all systems. Those who strive for faultless operation are often the most disappointed when the inevitable occurs. When customers are on their back they can often lean quite heavily on their partners to help them find answers.

The public results of the root cause analysis have been posted. The short of it:

…we found that there were a handful of [system status] messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect.

The details are a bit terse but one can surmise that the monitoring and failure analysis mechanisms used to manage the system were the mechanism that caused the failure. As they put it “…when the corruption occurred, we didn’t detect it and it spread throughout the system causing the [failure]…”

This reminded me of the 1990 AT&T Network “crash”. The entry of one system into a failure mode in fact propagated the failure across the entire network.

It’s one thing to know the routine operational state of a software system. There are litterally thousands (more likely millions) of people across the globe that understand the routine operation of the Linux kernel and the associated protocols, userspace daemons, and applications to a high degree of competency. I count myself as one of them for a number of subsystems.

I have a high respect for the engineers I’ve worked with at MontaVista and elsewhere who understand their codebases so well that they can litteraly imagine the consequences of unexpected situations that real code finds itself in and the correlate symptoms into root causes.

When the unexpected inevitably occurs you’ll want one of these people available to you. I’ve seen them make a difference. What’s your plan?

Brad

Vile defect most evil!

Monday, May 19th, 2008

[ This is first in a series of some of the most unusual, challenging, or just plain odd support engagements I’ve observed over they years. Names and companies have been changed to protect the innocent. ]

CASE #1-M6P7/1771

LOCATION: New Jersey

(dun-DUH)

Customer reported a periodic hang when booting their system. The only direct symptom was the that the last kernel message displayed was: “Starting kswapd v1.8″. The behavior seemed to be a heisenbug since the frequency of the hang was only once every 10 boots. Some hardware hung more, some hung less. At times hitting the physical reset button recovered the system. One particular hardware instance would hard-lock resisting all recovery methods unless the system was hard power cycled by removing the AC plug.

Uh oh… that’s not good. Was the software so mucking the hardware up so much that a cold power cycle was the only thing that would fix it?

The customer jumped into a range of experiments adding and removing various hardware components and altering the amount of “power-off rest time” between test cycles. The results were conclusive: There is no one thing that made the situation better. Tweak the GigE, SMP, RT scheduler, boot device, nothing mattered.

On the MontaVista side we had been discussing this issue in depth with the customer all along and suggesting various tests to try. We also confirmed that we couldn’t reproduce the issue on the hardware that we had in house. This was an issue confined to the customer’s specific brand of x86 motherboard that they purchased. We were concerned because we sometimes see odd hardware attributes on custom hardware but when a commercial product is used the incidents of hardware weirdness are far less frequent.

Most of our customers are using hardware that MontaVista doesn’t have in our lab. That’s the nature of the embedded software industry. This imposes some natural constraints… we can’t kill defects that we can’t reproduce or observe reliably. Sometimes the only thing that can help is to get the customer’s hardware in-house.

FedEx to the rescue. What happened next was quite a surprise to me.

(dun-DUH)

TO BE CONTINUED.

Close
  • Social Web
  • E-mail
Developer Resources
Contact Us    Careers    Blogs    Request Information    Resource Download Library
©2008 MontaVista Software, Inc. All Rights Reserved