Why We Are Moving Away from Xen and Hypervisors for Safety - Episode 1
Author: Corey Minyard
Introduction
A short while ago MV gave a talk saying we were exploring hypervisors like Xen for safety critical systems. The theory was that a hypervisor was a simpler and easier to certify piece of software; you could build safety-critical systems on top of the hypervisor running bare on the hypervisor, and still have Linux on the side. This was just exploration, and we are no longer moving in that direction.
We have a number of reasons; and I will talk a little more about why I think this is a bad idea.
Math
Basically, it all boils down to probability. From an availability point of view, we have many customers who meet 5 9’s of availability, or about 5 minutes of downtime a year. Our Linux has even been part of systems that met 6 9’s of availability (30 seconds a year), proven in the field, mostly through the crazy diligence of the customer. These types of things require a failure probability on the order of 10^-4 to per hour. Availability targets are generally met with redundancy, so a race condition in one system is unlikely to happen again on a redundant system. Linux is quite suitable for these sorts of things.
Safety is a completely different ballgame. 10^-5will only get you to ASIL-A, where Linux might be suitable. And “might” should be emphasized; I believe even 10^-5is a stretch for Linux. That’s one failure every 100,000 hours, or about 11.5 years. But nobody seems to want ASIL-A. Getting to ASIL-B requires on the order of10-6failures per hour, or one in about 115 years. That’s not something I would be confident about for Linux.
These numbers need some explanation. Highly available systems are rated based upon MMTF (Mean Time To Failure) and MTTR (Mean time to repair). So if you have a system whose failure rate is 10^-4(MTTF of 5x10^-3) and a MTTR of 5 hours, you get an availability number of: A =MTTF/(MTTF + MTTR) =5000/(5000 + 5)=.999 or 3 9’s (Yes, I know, I should be using MTBF in these equations, but the difference is insignificant for these calculations). Putting two of these systems together into a redundant system, the availability is basically bound by the average amount of time both systems will be down at the same time, using the equation At=1 -1n(1 - A(n)) = 1-(1-.999)(1-.999)=.99998999or almost 6 9’s. But you have to include maintenance times when the system is simplex (only one system is running) and things of that nature, so it will generally be less than that, probably close to 5 nines. And in these sorts of systems, a failure generally doesn’t result in people dying. You can’t get to your web page or you can’t make a phone call. The probability of a failure coinciding with an event that would require a phone call to prevent a death is very low.
The safety requirements may seem extreme, but if you put it in perspective, they are not. According the the AAA, people in the US drive an average of 293.3 hours a year and the population of the US was 327,096,265 in 2018. That’s 96 billion hours of driving. If your failure rate was 10^-6per hour and assume one in a hundred failures resulted in a death (that’s just a wild guess), you get about 1000 deaths a year. These are rough numbers, obviously, but they give a perspective on the size of the problem and why the reliability rates on safety critical systems need to be so high. 1000 deaths per year is way better than humans: there were 36,560 traffic deaths in the US in 2018. But it’s still a big number, and it does not include all the people who are not killed but significantly impacted by an accident. Car makers are not going to want to take on that level of liability.
Back to Linux
So while I believe meeting 10^-4or perhaps even 10^-5could be achievable with Linux using current techniques, meeting the level of reliability required for safety critical systems, 10^-6 to 10^-8, is not.
I don’t think it’s a matter of reviewing harder or longer or static analysis or anything like that. Linux is an amazing piece of software with reliability well above average, with good systems and procedures about it. But there are fundamental issues with meeting those levels of reliability in multithreaded software. I’ll talk about why I believe this in my next post.