Vile defect most evil, the conclusion.
May 23rd, 2008Where we last left off our intrepid support team was just getting the FedEx package with the customer’s hardware in house. At this point, as the customers FAE, I’m pretty sure that the customer is going to rip me a new one any moment.
Thankfully (I appreciate it Geoff) he didn’t do that.
Since the error had been in reboot we were using some scripting to automatically race through the cycles. The kernel has a sequence it uses in booting up and not all of it actually produces printf output. You can also have issues where printf has been sent but not displayed. Adding printf instrumentation can be helpful… but it give you certain information. Some of you out there are thinking JTAG… but this was x86. Sorry.
The approach you take on this kind of defect is that you start simplifying the system. The customer had done some of this but we wanted to quickly duplicate his progress to make sure it was accurately conveyed. Then you might also start disabling BIOS features. Our support engineer soon noticed a correlation between the hang and activity happening on the parallel port I/O ports.
With a solid lead in hand the engineer took to just abusing the parallel port directly using a userspace test application. This particular system, as it turns out, had a hardware fault. Reading repeatedly from port 3Bc would, within minutes, cause the entire hardware platform to lock up.
Next the engineer went and wrote an x86 assembly code program that booted from a floppy that abused the port as described. This independently proved that the fault was in the hardware, not Linux.
I’ll probably always remember this issue because it wasn’t until right at the end that we became convinced that it wasn’t a software issue. We always suspected… but could never confirm.
Next debug horror story is about IDE drives.
Thanks for reading,
Brad


