Why We Are Moving Away from Xen and Hypervisors for Safety - Episode 2
Author: Corey Minyard
Reliability and Linux
In my previous post I talked about the magnitude of the reliability problem for safety critical software. In this post I’ll focus on bugs in Linux.
I work as the kernel architect for MontaVista, and along the way I have worked on a large number of difficult bugs in the kernel. I have worked on some easy ones, too, but those are mostly handled by others. I get them after others have been unable to make progress. I want to use some of these bugs as an illustration for why I think the kernel will not be suitable for safety critical systems with current analysis techniques.
A Memory Trampler
The first bug I’ll talk about was a memory trampler. This is the only real memory trampler I have seen in the kernel; in my experience and observation they are very rare. They can be easily found with static analysis or review. These do not bother me from a safety point of view.
In this situation, a piece of the page table (struct page) was being randomly overwritten. It was always in the page table, so at least we had some consistency. I modified the kernel so the page table was read-only; then added code around anything that wrote to the page table to set only the particular page being written as writable only during the write. After that rather difficult modification and some review and testing, I sent them the patches. The customer applied these patches and caught the write to the page table that caused the issue. It turns out the problem was actually in a kernel module the customer wrote. This module had worked fine on the previous version of our product. I don’t have any root cause information, the customer didn’t give us any more information after we pointed to the function causing the issue.
Even though this sort of bug doesn’t bother me overall, this experience taught me that the kernel is not a place for people who don’t deeply understand the kernel. Being a kernel engineer, some things that I consider self-evident are not at all obvious to others. Many of our customers modify the kernel for their own needs; such modifications come with significant risk.
A Firmware Bug
Not all bugs in the kernel have a kernel origin, strangely enough. A customer had Linux running on a card in their lab, and often at boot time or shortly after the card would crash. They were able to get a kernel core dump and send it to me.
The analysis for this was really pretty easy. I looked at what was going on when the kernel crashed, and after some poking around, I realized that the machine code that was being executed in memory did not match what should have been in memory. I extracted the incorrect memory and sent it back to the customer, hoping it would mean something to them. They recognized their lab IP addresses in the dump, and we figured out that it was an ARP packet. It turns out that the firmware was not disabling the ethernet device before starting the kernel. If a packet was received before the kernel reset the ethernet device, it would DMA over memory.
A Race Condition
I’m currently working on a bug identified by a test suite for gensio, a project I developed. I spent a lot of time trying to figure out what was wrong with my code. After all, blaming the kernel is like blaming the compiler. You better be darn sure.
However, after racking my brain, I wrote a small reproducer and sure enough, it was the kernel. If you write to a master pty and close the pty, it will occasionally drop a chunk of data in the middle of the data due to a subtle race condition. This has been in the kernel for a long time and nobody has noticed. The tty code is so complicated that the maintainers aren’t quite sure how to fix it.
A Use After Free Bug
For the last illustration, I’ll talk about a bug I worked on recently. The customer was really beating on the network neighbor code: The generic code that handles ARP and things of that nature. Every once in a while the kernel would crash, usually in the timer code or something related to that. The timer data appeared to be totally bogus, and after some analysis and debugging patches the evidence pointed to a use after free. The data in the timer was destroyed, so we couldn’t tell where the timer came from. We actually didn’t know it had anything to do with the neighbor code at the time; we just know that something was crashing related to timers.
To trace this one down, I wrote some code that kept track of all running timers in a data structure, then added code to the memory free routines to panic if a known running timer was in that chunk of memory.
Then, of course, the problem stopped happening. A Heisenbug. I assume the extra time in the free code caused the timing to shift enough to mask the problem. The customer left the debug code in; it was efficient enough to not affect the operation of the system. Months later, it finally trips. The problem was fixed in a later kernel patch, we believe (this was quite recent so we aren’t 100% sure yet), but the patch’s header had no mention of this type of race.
So What?
In my next post I’ll talk about why I think these bugs are illustrative.