Why We Are Moving Away from Xen and Hypervisors for Safety - Episode 3
Author: Corey Minyard
The Kernel Conclusion
In my previous posts I talked about reliability and about some kernel bugs I have worked on. But what do they mean?
Note that none of these bugs had reproducers the customer could give us. They only occurred in full lab situations, or worse, on customer sites. For the most part our customers are serious about quality and reliability; they generally have kernel coredumps set up and mechanisms to record crash information; and they test heavily. So they push the systems to their limits and are able to extract necessary information in the event of a failure. We test, too, and the kernel receives tons of testing from many different sources. And still, the bugs persist.
These bugs examples are illustrative, but they match well with the range of bugs I have worked on. So:
- If the customer adds code, it will almost certainly have bugs. It will not have been vetted through the mainstream kernel process and will almost certainly be of worse quality than the mainstream kernel code unless significant measures are taken.
- Firmware is another source of hazards. Some systems can run without firmware after boot, though the firmware is still critical because it boots the system. Other systems have firmware that runs all the time. The firmware is just as important as the software, and is often closed source. I’ve seen a lot of firmware code, and, well, it does not inspire confidence in me.
- The kernel is full of race conditions. They are found all the time. These sorts of bugs are often very hard to see and no current analysis tools can handle something of the size of the kernel. They occur rarely and when they occur, tracing them to root cause is hard. I have seen several instances where the bug is fixed in later patches, but those patches were quality improvement patches that didn’t even know they were fixing the bug.
- There are race-condition type bugs dealing with the ordering of memory operations in optimizing compilers and superscalar systems. (Search for memory barriers if you have questions). Getting the proper barriers in place can be very difficult. Fortunately, most Linux code relies on mutexes and locks for ordering, but these sorts of bugs are still probably lurking around in the kernel and they are even harder than race conditions. And then there’s RCU, which is also very hard and heavily used in the kernel because you can’t scale without it. (Search for Linux RCU if you have questions. RCU can be a bit hard to understand at first.)
It is possible to manage customer-added code through quality processes.
Firmware is harder, but a very stripped down bootloader that does not do anything after boot is probably manageable. But most firmware introduces a huge black box into the execution of the system. Moreso now that in the past with ACPI and EFI being able to preempt the operation of the system without the OS’s control. The real-time Linux group has identified firmware on some systems as a source of latency; on some systems the firmware can run asynchronously to the operation of the OS. For safety, the firmware needs to meet the same stringent requirements that the rest of the system has to meet. EFI and ACPI are very large, rivaling even the kernel in size.
Kernel race conditions are not really manageable with our current capabilities. I wish I had hard numbers, but such things are very hard to come by. I can only give answers from my experience maintaining a Linux kernel.
In my opinion, getting Linux to 10^-6will require tools that can find the race conditions and use after free scenarios that I discussed before. Until such things exist, in my opinion the Linux kernel by itself cannot meet the failure probabilities required for safety critical systems. Even with those tools, getting to 10^-7seems daunting. Getting to 10^-8seems impossible to me for a preemptive system. It’s just too complicated. That’s one failure every 11,500 years. I don’t even know how you could measure that. Maybe if we get formal analysis techniques that can scale large enough, but we are a long way from there.
And this is only the kernel. We need to consider what we kernel engineers call “userland”. If safety critical functions run there, the necessary libraries and tools in userland matter, too. That is more code than the kernel, and in many cases with less of a focus on quality. And then there’s the firmware. And then the hardware. Yes, things look pretty bad.
So What About Xen?
I have spent some time looking at Xen. It is a smaller piece of code than the kernel, and is somewhat less complex. It seems well written and well designed. But all the same problems still apply. It’s still a multi-threaded preemptive system. I’m sure it’s full of race conditions. Plus, something has to manage all those devices that Xen is not managing but that you need. It’s as if certifying the Linux kernel for safety is like swimming the Pacific Ocean. It’s impossible. And certifying Xen would be like swimming the Atlantic Ocean. It’s a lot smaller, but it’s still impossible.
Any hypervisor is going to have the same issues. They do the same things.
We need something like swimming the English Channel. Sure, it’s hard, but it can be done.
So Do We Abandon Hope?
Maybe not. I’ll talk about some options in the next post.